Understanding of the inverted index

Inverted index is the main method of full-text search. The content is divided into words
by word segmentation , and each word is taken out as the key, and the value is the article ID where the word is located.

For example, we have a MySQL database as follows:

Insert picture description here


If we want to search, keywords are generally used to search for content, such as we have to search for articles within hotthis word article URL, then you need to use select url from article where content like '%hot%'such a statement to search.
But because the B+ tree index can only support the prefix search, for example hot%, the prefix search can use the index for fast search. Since it is not a prefix search in general, the index cannot be used, so it becomes a full table search, which is very slow in efficiency.

Because the B+ tree can't complete the search operation well, then full-text search appears. Full-text search is a technology that finds out any content information in an entire book or an entire article stored in a database.

Inverted index

Full-text search is implemented using an inverted index. The inverted index, like the B+ tree index, is also an index structure. It stores the mapping between the word and the position of the word itself in one or more documents in the auxiliary table ( inverted table ): {word, ID of the document where the word is located}
For the article data table above, then its auxiliary table (inverted list) is:

Insert picture description here


Figure, content field, all the words are taken out as the Textvalue of the field, and then documentsis stored in the document ID, such as the above coldappeared in the 1 and 4 of these two articles.

Tokenizer

To extract words from a paragraph in the content field, you need to use a tokenizer, which can divide a paragraph into words

Analyzer:

effect:

  1. Divide a piece of text into independent entries suitable for inverted indexing
  2. Unify these terms into a standard format to improve their 可搜索性

The analyzer actually encapsulates three functions into one package:

  • Character filter:
    ​ First, the string passes through each character filter in order. The purpose is to sort the string before word segmentation. A character filter can be used to remove HTML, or convert & to and
  • The tokenizer : ​ The
    string is divided into individual terms by the tokenizer. When a simple tokenizer encounters spaces and punctuation, it may split the text into terms
  • Token filter: ​ The
    entry passes through each token filter in order. This process may change entries (for example, lowercase words), delete entries (such as useless words such as a, and, the), or add entries
The above inverted table is completely based on the content, if we use an analyzer to segment words, it may be inconsistent

to sum up

Once you have inverted list, we can add to the inverted table B + tree index to quickly find keywords to use when searching for the index, and then get articleIDalso faster to find the URL after the article, which use a full table to find than direct Much faster. So search engines use inverted index for full-text search