With the explosive growth of Internet data, the current mainstream e-commerce platforms have more and more product categories and numbers, but it is increasingly difficult for users to easily find the products they need.
The core function of the e-commerce search recommendation system is to retrieve suitable products from a large number of products and display them to users according to the user's search intent and preferences. In this process, the system needs to calculate the similarity between the product and the user's search intent and preference, so as to recommend the TopK products with the highest similarity to the user.
Data such as commodity data, user search intent, and user preferences are all unstructured data. We try to use the search engine Elasticsearch (ES) CosineSimilarity (7.x) to calculate the similarity of such data, but this method has the following disadvantages:
- The calculation response time is long-the average delay for retrieving millions of products and recalling TopK results is about 300 ms.
- ES index maintenance cost is high-commodity vector data and other related information data use the same set of indexes, which is not only inconvenient for index construction, but also causes the data scale to become too large.
We have tried self-developed local sensitive hash plug-in to accelerate ES's CostineSimilarity calculation. Although the accelerated performance and throughput have been significantly improved compared to before, the 100+ ms delay is still difficult to meet the actual online product retrieval needs.
After research and comparison, we decided to use the open source vector database Milvus. Compared with the stand-alone version of Faiss used in the industry, Milvus's advantages are: support for distributed, multi-language SDK support, separation of reading and writing, and so on.
We use various deep learning models to convert massive unstructured data into feature vectors and import them into Milvus. With the excellent performance of Milvus, the e-commerce search recommendation system we built can efficiently query TopK vectors that are similar to the target vector.
As shown in the figure, our overall architecture is mainly divided into two parts:
- Writing process: Normalize the item vector generated by the deep learning model and write it into MySQL. The data synchronization tool (ETL) reads the item vector in MySQL and imports it into the vector database Milvus.
- Reading process: The search service obtains the user vector according to the user's query keywords and user portrait, queries the similarity vector in Milvus, and recalls TopK item vectors.
Milvus supports two methods: incremental update and full update. Every incremental update must delete the existing item vector and insert a new item vector, which means that every time a collection is updated, the index must be rebuilt, which is more suitable for scenarios with more reads and less writes. Therefore, we choose the full update method. Moreover, it only takes a few minutes to write millions of full data in batches and multiple partitions, which is equivalent to near real-time updates.
The Milvus write node is responsible for all write operations, including creating data sets, building indexes, inserting vectors, etc., to provide services to the outside world by writing domain names. Milvus read nodes are responsible for all read operations and provide external services with read-only domain names.
Since Milvus currently does not support alias switching of collections, we introduced Redis to achieve seamless switching of aliases between multiple full data collections.
The read node only needs to read the existing metadata information and vector data or index from the MySQL and Milvus database and the GlusterFS distributed file system, so the read capability can be expanded horizontally by deploying multiple instances.
The data update service includes not only writing vector data, but also vector data volume detection, index construction, query warm-up (loading index files into memory), alias control, etc. The overall process is as follows:
- Assume that before constructing the full amount of data, the collection CollectionA provides external data services, and the full amount of data being used points to CollectionA (redis key1 = CollectionA). The purpose of constructing full data is to create a new collection CollectionB.
- Commodity data verification-check the number of commodity data in the database table, compare the existing CollectionA data, and set alarms based on quantity and percentage. If the set number (percentage) is not reached, the full amount of data will not be constructed, which will be regarded as a failure of this construction and an alarm will be notified; once the set number (percentage) is reached, the full construction step will be initiated.
- Start to build the full amount-initialize the alias of the full amount of data being built, and update Redis (after the update, the alias of the full amount of data being built points to CollectionB: redis key2 = CollectionB).
- Create a new full collection-to determine whether CollectionB exists. If it exists, delete it and create it again.
- Batch write-take the modulus of the product data ID, calculate the partitionId of the partition where it is located, and write multiple partition data into the newly created collection in batches.
- Build index and warm-up-create index createIndex() for the new collection, and the index file is stored in the distributed storage server GlusterFS. Automatically simulate the request to query the new collection, load the index content into the memory, and realize the index warm-up.
- Collection data verification-verify the data of the new collection, compare the data of the existing collection, and set alarms based on quantity and percentage. If the set number (percentage) is not reached, the collection will not be switched, it will be regarded as a failure of this construction, and an alarm will be notified.
- Switch collection-alias control. After updating Redis, the full aliases being used point to CollectionB (redis key1 = CollectionB), and Redis key2 is deleted at the same time, and the construction is complete.
Obtain user vector according to user query keywords and user portrait, call the data of Milvus partition multiple times and calculate the similarity between user vector and item vector, and return TopK item vectors after summarizing. The overall schematic diagram is as follows:
The following table lists the main services involved in this process. It can be seen that the average delay of recalling TopK vectors is about 30 ms.
- According to the user query keywords and user portrait information, the user vector is calculated through the deep learning model.
- Obtain the collection alias of the full amount of data being used from Redis currentInUseKeyRef, and get the Milvus CollectionName (data synchronization service, after the data is completely updated, switch the alias to write to Redis).
- Use the user vector to concurrently and asynchronously call Milvus to obtain data from different partitions of the same collection. Milvus calculates the similarity between the user vector and the item vector, and returns similar TopK items in each partition.
- Summarize the TopK products returned by each partition, and then sort the results in the reverse order of similarity distance (using the IP inner product calculation, the greater the distance, the more similar), and the final TopK products are returned.
At present, Milvus-based vector recall can be used steadily in the search of recommended scenes. Its high performance allows us to have more room to play in the dimensions of the model and the selection of algorithms.
There will be more scenarios that use vector similarity calculations in the future, including the recall of the main site search and full-scene recommendation, etc. Milvus will play a vital middleware role in it.
The three most anticipated functions of Milvus in the future are as follows:
- Collection alias switching logic-no need to coordinate the switching of multiple collections through external components.
- Filtering mechanism-Milvus v0.11.0 only supports ES's DSL filtering mechanism in the stand-alone version, and hopes to introduce a filtering mechanism that supports read-write separation as soon as possible for vector correlation checking.
- Storage supports Hadoop Distributed File System (HDFS)-the 0.10.6 version we use only supports POSIX file interfaces. We deployed GlusterFS that supports FUSE as the storage backend, but HDFS is better in terms of performance and scalability. Excellent choice.
Lessons learned and best practices
- For read-based applications, separate read-write deployment can greatly increase the processing capacity of the machine and improve performance.
- The Milvus Java client does not have a reconnection mechanism. Because the Milvus client used by the recall service is resident in memory, it is necessary to establish a connection pool by itself, and ensure the availability of the connection between the Java client and the server through a heartbeat test.
- Milvus occasionally has slow queries. After investigation, this is due to insufficient preheating of the new collection. Query the new collection by simulating request parameters, and load the index content into the cache to achieve the effect of index preheating.
- nlist is the indexing parameter, and nprobe is the query parameter. It is necessary to obtain reasonable thresholds through stress testing experiments according to their own business scenarios to balance retrieval system performance and retrieval accuracy.
- For static data scenarios, it is more efficient to import all the data into the collection first and then build the index.
Github @Milvus-io| CSDN @Zilliz Planet| Bilibili @Zilliz-Planet
With the vision of redefining data science, Zilliz is committed to building a world-leading open source technology innovation company and unlocking the hidden value of unstructured data for enterprises through open source and cloud native solutions.
Zilliz built the Milvus vector similarity search engine to accelerate the development of the next-generation data platform. Milvus is currently an incubation project of the LF AI & Data Foundation, capable of managing a large number of unstructured data sets. Our technology has a wide range of applications in new drug discovery, computer vision, recommendation engines, chat robots, etc.