Xianyu commodity understanding data analysis platform-Dragon Palace

introduction

Xianyu is a C2C-based platform, which is different from B-side users. C-side sellers tend to prefer the light release model of figure + description when releasing products. They often lack execution and professionalism for the structured information of supplementary products. High, which brings great difficulties to our product understanding. In order to be able to obtain more product structured information on the publishing side, we began to try to add supplementary options for the key attributes of the product to the original minimal publishing model of graphics + text. It turns out that the appropriate structured attribute options will not affect users. The publishing experience can greatly enhance our ability to understand the product. However, there are the following problems:
When setting structured attribute options, they often rely heavily on industry operation experience, and lack real-time, multi-dimensional data analysis methods. Although the data reports produced offline can count certain key indicators to a certain extent, the scalability and performance of offline reports are insufficient for detailed and personalized data query requirements.
Based on the above problems, we built a data analysis platform for Dragon Palace.

Positioning and overall framework of Dragon Palace data analysis platform

Different from the data report, we mainly considered the following aspects when designing the Dragon Palace data analysis platform:

  • Real-time requirements. When new strategies are launched in operation or data fluctuations occur online due to service issues, we hope to analyze the coverage of structured category attributes in real time during this period to help operations make further decisions.
  • Multi-dimensional requirements, Xianyu currently has 8000+ leaf categories, and different industries have different operating focuses. The data analysis platform must be able to meet the needs of personalized data analysis.
  • Data management requirements, Xianyu’s category attributes, SPU data, and operation strategies need to be managed in a unified manner.

We hope to realize the feedback of structured data to operations and form a closed loop of commodity structured data production and application.

The overall layered framework is as follows:

The key to building a data analysis platform is the construction of data links. In Xianyu, structured data is mainly divided into online data (data that users directly fill in by publishing and editing entrances) and offline data (analyzing product graphs through post-algorithmic models). Data obtained from the text). The construction of data link has the following key difficulties:

• Large amount of stored data (full amount 2 billion+), high access QPS (15,000+), and high service stability requirements.

• There are many data sources (more than 10 kinds), the data from each source is heterogeneous, there are repeated and conflicting data, and the real-time data requirements are high (second-level delay).

• The data analysis scenario is complex (QPS is small, but SQL complexity is high), and ordinary database queries are difficult to support.

In view of the large amount of data and high QPS, we choose tableStore as a database for storing commodity structured information. It is a typical column storage database with good scalability, high availability, and a single machine that can support tens of thousands of QPS. It is very suitable for use as a database. Big data storage terminal. Its availability can reach 99.99%, and it has the capability of active and standby dual databases.
At the same time, our online data is stored in the mysql commodity table, and the data is written into the data source table by monitoring table changes in the java application; offline data is passed to the algorithm module through ODPS+MQ, and the algorithm result is transmitted through blink Write to the data source table. Since online and offline multi-source data may have duplication and conflicts (the same product algorithm A is identified as iphone 12 and algorithm B is identified as iphone 11), we use the source table to store all the original data when designing the system. The table is used to store the data after processing fusion, and the strategy of processing fusion is the product and operation decision-making.
We use analytical database ADB to analyze data. ADB is far less than tableStore in terms of storage capacity and single-machine query QPS, but it has performance that other databases cannot match in terms of complex SQL operation, real-time index creation, and cold and hot data isolation. It is a better choice for data analysis library.

Access to offline heterogeneous data sources

In Xianyu, structured data does not only come from sellers filling in at the time of release. As mentioned earlier, Xianyu's C-end sellers are far less professional and executive in filling in structured attributes than sellers on Taobao and Tmall, so We use graphic and text multi-modal algorithms to add a lot of structured attributes to the products in the post-links released (this part of cpv currently accounts for about half of the market coverage, and different categories are different). Access to these offline data has the following difficulties:

• Each data has the characteristics of different structure, different output time, and large data level. It is difficult to reuse the same model, and the cost of accessing new data sources is high.

• Data synchronization tasks are scattered, making it difficult to perform unified monitoring.

In response to these difficulties, we have designed a set of solutions for unified access of offline heterogeneous data sources:

The offline data of each algorithm is stored in ODPS. The data format of each algorithm is different, and the data partition is also different. Therefore, the data of each data source is unified into a structured standard label table idle_kgraph_std_source through an ODPS synchronization task. The table structure is as follows:

The key in the table is the primary key information. Because the primary key of the data in different scenarios is different, it is designed as an open primary key, the data is in json format, the key is the primary key column name, and the value is the primary key value. The structured standard table idle_kgraph_std_source is synchronized to the data table of each scene in the tableStore in real time through a Blink task. In the Blink task, the data is distributed according to the scene and source fields, and the data is routed to different columns in the tableStore table according to the key in the data. At the same time, in order to improve efficiency and reduce the number of times of writing to the database in the Blink task, after obtaining the data, the data is merged first, and multiple sources of data of the same scenario (such as structured attribute data) are merged into one, and then the write operation is performed .

Through this set of solutions, we have successfully solved the problem of difficult data collection and unified monitoring in the access of multiple data sources. At the same time, the design of the open data format in the data standard table allows new data sources to be quickly accessed, which greatly reduces The cost of repeated development.

Data processing fusion

After obtaining data from multiple sources, we need to process and integrate the data. The integration strategy is determined by the product and operation. When the strategy is changed, the existing product data also needs to be processed and integrated again, so the data processing integration link Must have the characteristics of stable incremental processing and fast full-volume processing.

When processing the full amount of data, using the distributed task scheduling system, the main task node divides the full amount of data into multiple copies through the sharding of the database, and sends the data index to each subtask node, and the subtask pulls the data to make Data fetching and processing are not restricted by the physical partitions and channels of the database, greatly improving performance. Currently, it only takes 40 minutes to process 600 million full data. The task distribution strategy is as follows:

In general, it mainly solves the following problems:

  • Distributed task distribution, distributed to complete all tasks.
  • The operation is idempotent, and the operation can be repeated without affecting the final result.
  • Full increments are isolated from each other without affecting online services

Data analysis module design

In the data analysis scenario, there are a lot of frequent queries involving forward and reverse sorting and filtering by a certain index. If a complete data query is performed for each data analysis request, it will cause greater pressure on the database.

Therefore, when designing the data analysis module, we divide the requested analysis conditions into two categories:

  • Dimensional analysis conditions: According to different dimensions, different query logic needs to be run. A Distributor will route the analysis request to different processors for execution.
  • Filtering and sorting conditions: These conditions do not affect the logic of the query, and only sort or filter the results of the query. In this case, we will first obtain the results from the cache, and sort and filter in the memory to improve analysis performance.


Based on the above solution, the cost of data analysis and query can be effectively reduced, and the average query efficiency can be increased by more than 50%.

effect

At present, the Dragon Palace platform has been extended to scenarios such as industry operation, Xianyu search, Xianyu homepage recommendation, etc., and has achieved phased results:

• Provide data analysis of 8000+ leaf category attribute dimensions for industry operations to help reveal the structured options of operational decision-making in the main release of Xianyu, and help it contribute 80% of the category coverage to Xianyu's structured market, and half of the core cpv cover.

• Provide quick query methods for search, recommendation and other scenarios to help developers and algorithm students locate online problems in real time and achieve second-level delay. Greatly improve the efficiency of badcase attribution positioning.

Outlook

We are committed to building Longgong into a comprehensive, flexible and accurate product understanding data platform. Next, we will continue to optimize the following aspects:

• Docking with product releases and trading markets, access to product diagnostic capabilities, provide more dimensional data analysis capabilities, promote scenario coverage, and help more products and operations quickly make decisions.

•Add more and more intuitive data representation, and optimize the interface and UI design.

• Increase user-dimensional data analysis capabilities, dock with algorithms, and feed back the results of data analysis to the algorithm, so that the algorithm model can predict accurate and personalized category attributes.