Retrieval technical introductory notes

Article Directory


Introduction to Search Technology


I mainly refer to a column of retrieval, 11, so that I can have a preliminary understanding of retrieval.

1. Basic knowledge

1.1 Why is retrieval difficult to learn?
On the one hand, most of the classic textbooks are too theoretical, not closely integrated with actual work, and it is too difficult to learn. For example, "Modern Information Retrieval" and "Introduction to Information Retrieval" are two very classic books, but most people report that it is difficult to read them. The content organization and examples in the books are larger than our actual work. difference.
On the other hand, textbooks combined with actual work often start from the perspective of a certain industry and introduce all aspects of the industry in a comprehensive manner, rather than focusing on a certain basic technology. For example, the database focuses on the model, sql, and transaction processing. Search engine textbooks will introduce crawlers, text mining, natural language processing, and web page connection analysis.
Therefore, the threshold is relatively high. For retrieval technology, there are theories and applications, but the comprehensive ones are relatively few.

1.2 How to learn?
First understand the basic data structure of retrieval, then understand some industrial solutions, and finally systematically understand how retrieval technology is applied to some storage systems, search engines, advertising systems, and recommendation systems.

1.3 The overall framework of search

Insert picture description here

Retrieval can be divided into four layers: hardware, data structure and algorithm, search expertise, and business application layer.

2. Basic technology

  1. For unordered arrays, we can traverse and retrieve. For ordered arrays, we can use binary search. The linked list has flexible adjustment capabilities and is suitable for occasions where data is frequently modified. In addition, it can also solve the problem of continuous data occupation. Therefore, multiple groups of ordered arrays can be connected through multiple linked lists. This is an efficient retrieval data structure.
  2. Regarding non-linear structure and frequently changing data retrieval, such as file directory, it is better to use binary search tree or jump table to do it than array etc.
  3. The above retrieval method is logn, so how can it be faster? Obviously, hash retrieval is possible, and the time complexity is 1. But in this way, as long as the number of hashed elements is greater than the upper limit of the array, conflicts will definitely occur. How to use open addressing to resolve Hash conflicts? "Linear probing", but this is very unfavorable for later insertions. In order to solve this problem, we can use "Quadratic Probing" to change the length of i to quadratic. In addition, double hashing is also possible. In addition, the chain method is also possible. Although the hash table uses the Hash value for direct subscript access, which brings O(1) level query capabilities, it also loses the "ordered storage" feature.
  4. For state retrieval, using bitmaps is better and saves space. Make judgments by bit arithmetic. If you still want to compress, compress the length of the array, but this is easy to cause hash collisions. Here we use bloom filters, that is, through multiple hash functions to reduce collisions, ** bloom filters no longer use one bit To represent an object, but use k bits to represent an object. In this way, the probability that the k bits of the two objects are the same will be greatly reduced, so that the problem of hash collision can be solved. However, the Bloom filter has the problem of error rate caused by coverage, but the probability of this possible situation is small, and it can be handed over to the disk to make a real judgment. **Although the principles and implementation of bitmaps and Bloom filters are very simple, they can be seen in many complex large-scale systems.
  5. Previously, what we learned was the forward index. The core of the inverted index is actually not complicated. Its specific implementation is actually a hash table, but it does not use the document ID or title as the key, but vice versa. As the key to store the corresponding document list, we can complete the query in O(1) time cost.

3. Advanced combat

Previously, it was all retrieval on the memory. Next, let's discuss the storage and retrieval with the help of disk. The relational databases we are familiar with, such as MySQL and Oracle, are such typical systems.

  1. Relational database retrieval: the principle of B+ tree. Non-relational database: The LSM tree is based on this idea to design such a mechanism: when data is written, it delays writing to the disk, and stores the data in the tree in memory for conventional storage and query. When the tree in the memory continues to grow larger and reaches the threshold, it is then written to the tree on the disk in units of blocks in batches.
  2. The index of trillion-level webpages by search engines is realized by using the inverted index. , It will index trillion-level websites, and the generated inverted index will be very large and cannot be stored in memory at all. We add a dictionary in front, if the dictionary is also very large, add a B+ tree to manage the dictionary.
Insert picture description here

3. The application of distributed in search: Using distributed technology, we can split the inverted index. The advantages of index splitting are: on the one hand, more index data can be loaded into the memory, reducing the number of disk accesses, so that retrieval efficiency can be greatly improved; on the other hand, it is based on document splitting, which can The query request is copied into multiple copies, which are completed in parallel by multiple index servers, and the time for a single search can also be shortened.
4. Accurate Top K Retrieval: In the current mainstream search engines, there are hundreds of main factors used for scoring. If we take so many relevant factors into consideration and add more parameters, then the BM25 algorithm cannot meet our needs. At this time, machine learning can come in handy. Scoring using machine learning is a hot research field in recent years, and it is also a scoring mechanism currently used by many search engines. Moreover, with the development of deep learning, more and more complex scoring algorithms have evolved, such as the use of deep neural network models (DNN) and related variants. After completing the scoring stage, we must pay attention to the efficiency of the sorting stage. For accurate Top K retrieval, we can use heap sort instead of full sort and only return the k results that we think are the most important. In this way, the time cost is O(n) + O(k log n). In the case of very large data magnitude, it will have a much higher retrieval performance than O(n log n).
5. Inaccurate Top K retrieval: High-quality retrieval results do not necessarily have to be very accurate. We only need to ensure that high-quality results are included in the final Top K results. There are three main ways to achieve acceleration for non-precision Top K retrieval. They are sorting truncation based on static quality scores, using the winner table, using word frequency for relevance judgments for truncation, and using hierarchical indexing to perform two queries on a query request. Layer retrieval. The core idea of ​​these three methods is to transfer the calculation from the online link to the offline link as much as possible, so that in the online link, that is, in the inverted search, we only need to make a small amount of judgment to quickly cut off Top K results, thereby greatly improving the search efficiency of the search engine.
6. Nearest Neighbor Retrieval: If we display all these articles without filtering in the search results or recommendation results, then users may see almost the same content on the first page. In this case, the user experience will be very bad. Therefore, in search engines and recommendation engines, de-duplication of similar articles is a very important link.

3. System case

search engine:

  1. You should be very familiar with search engines. It is a very important tool in our study and work. Its characteristic is that it can quickly find out the information we need in trillions of web pages. It can be said that retrieval technology represented by search engines can be learned and referenced by all retrieval systems based on text and keywords.
  2. The core system of search engine is divided into three parts, namely crawler system, index system and retrieval system.
Insert picture description here

Crawler system: A high-performance crawler system is used to complete continuous web page crawling, and the crawled web pages are stored in the storage platform. Generally speaking, we can store the crawled webpages in HBase based on the LSM tree to support efficient reading and writing of data.

Index system: After the crawler system crawls the web pages, we need to perform a series of processing on these web pages before they can become usable indexes. The processing can be divided into two stages. The first is to pre-process the webpages. The main methods include deduplication of similar webpages, webpage quality analysis, word segmentation, etc. After processing the webpages, we must generate an index for the search engine. The generation process can be divided into three main steps. 1 Split the index (two aspects of document and web page layering) 2. Index construction: Generate corresponding inverted index files, each inverted index file represents an index segment, and they can all be loaded to the online server To provide retrieval services. Index update: (I don't know much about it). With the index created in this way, search engines can provide efficient retrieval services for trillions of web pages.

Retrieval system: The user searches for a keyword, and the retrieval system first needs to do query analysis, that is, to find out the user's real query intention by analyzing the query term itself and the user's behavior characteristics. If the query term is found to be wrong or there are few results, the search engine will also correct spelling or recommend related queries, and then use the rewritten query term to retrieve the query results in the service. In the retrieval service, the search engine will send the query words to the corresponding index fragments, and the index fragments will return the fragmentation results they are responsible for through the retrieval mechanism of the inverted index. For the returned results, the search engine uses machine learning to score (offline) based on correlation analysis and quality analysis, and selects Top K results (Lecture 11) to complete the search. The biggest feature of search engines is that users have obvious intentions. This is not the same as advertising and recommendation.

After reading

  1. I learned a lot from this. I know the key points of the index. I didn’t know the specific development of the index before learning mysql. From the array, linked list, hash table, binary tree, red-black tree, B+ tree, inverted index (find the key according to the value) ). According to this, I know the status of mysql.
  2. From the guiding theoretical knowledge, it can be seen that the author's summary is broad and profound. This is the level of a transferable talent who has been specializing in one technical point. At this point, how can one be afraid of inward roll? This field involves algorithms, and it is estimated that the academic qualifications are not low. If you study a Ph.D., this field is really suitable.
  3. How to build search from scratch? At the beginning, we can use mysql's full-text search. When the number of users is hundreds of thousands, we can access the industry's mature (everyone is using) es, which can basically solve most of the problems. However, if the amount of data is too large and functions are to be defined, it must be self-developed. For example, use C++ to do a set of services.
  4. This is also the first time I have come into contact with a wide range of deep learning applications. What I did before was also a classification task, but this is a text classification work. I haven't done it before, but the feature extraction is still similar. I hope what I learned before will be useful here. But algorithm and engineering seem to have nothing to do, engineering only needs the characteristic parameters of sorting.
  5. Through this column, I understand the working principle of search, and I am not completely ignorant of the field of search. Next, you can combine with a book dedicated to search, read the details of the search engine in depth, and understand the professional vocabulary in the industry. You can practice through es in the middle.