Technical dry goods | How to implement IM SDK full text retrieval of chat messages on Electron

Guide reads : On the IM client demand scenario, full-text search based on local data (Full-text search) plays an important role. This article specifically talks about how NetEase Yunxin implements full-text retrieval.

Article|Li Ning, Senior Front-end Development Engineer, NetEase Yunxin

The so-called full-text search is a technique to find the position of a certain word in a large number of documents. In the past relational databases, it can only be achieved through LIKE, which has several drawbacks:

Unable to use database index, need to traverse the whole table, poor performance

The search effect is poor, only the first and last positions can be fuzzy matching, and the complex search requirements cannot be realized.

Unable to get the relevance of the document and the search criteria

The full-text search function of local data based on libraries such as SQLite has been implemented in the iOS, Android and desktop of NetEase Yunxin IM, but this part of the function is missing on the Web and Electron. On the Web side, due to browser environment restrictions, the only local storage database that can be used is IndexDB, which is not within the scope of the discussion for the time being. On Electron, although Chromium's kernel is also built-in, because of the ability to use Node.js, there are more choices.

Let's first look specifically at how to implement full-text retrieval.

1 Basic technical knowledge points

To achieve full-text retrieval, the following two knowledge points are inseparable:

Inverted index

Participle

These two technologies are the technologies and difficulties to achieve full-text retrieval. The implementation process is relatively complicated. Before we talk about the implementation of full-text indexing, let's talk about the implementation of these two technologies.

First briefly introduce the inverted index, the concept of the inverted index is different from the forward index:

Front index: The unique ID of the document object is used as the index, and the content of the document is the structure of the record

Inverted index: The word in the document content is used as the index, and the document ID containing the word is used as the structure of the record

Take the inverted index library search-index as a practical example. In NetEase Yunxin's IM, each message object has idClient as a unique ID. Next, we enter "Today's weather is really good" and separate each Chinese word (the concept of word segmentation will be shared in detail below), so The input becomes "now", "day", "day", "qi", "true", "good", and then write it into the library through the search-index PUT method, and finally look at the structure of the stored content :

As shown in the figure, you can see the structure of the inverted index. The key is a single Chinese word after word segmentation, and the value is an array of idClients containing the Chinese message object. Of course, in addition to the above content, search-index also has some other content, such as Weight, Count, and front row data, etc. These are for sorting, paging, search by field and other functions. This article will not elaborate. Up.

Participle

Word segmentation is to divide the content of the original message into multiple words or sentences based on semantics. Considering the effect of Chinese word segmentation and the need to run on Node, we chose Nodejieba as the basic word segmentation database. The following is the flowchart of jieba word segmentation:

Take "Go to Peking University to play" as an example, let's select the most important modules to analyze:

Load dictionary

The jieba word segmentation will first load the dictionary when it is initialized, and the general content is as follows:

Building a prefix dictionary

Next, a prefix dictionary will be constructed based on the dictionary, the structure is as follows:

Among them, "Peking University" is the prefix of "Peking University", and its word frequency is 0, which is to facilitate the subsequent construction of DAG graphs.

Build a DAG graph

DAG graph is the abbreviation of Directed Acyclic Graph, that is, directed acyclic graph.

Based on the prefix dictionary, the input content is segmented. Among them, "Go" has no prefix, so there is only one segmentation method; for "North", there are three segmentation methods of "North", "Beijing", and "Peking University"; for "京", there is also only one segmentation method. Segmentation methods; for "big", there are two segmentation methods, "big" and "university"; for "learning" and "playing," there is still only one segmentation method. In this way, the segmentation method of each word as a prefix can be obtained, and the DAG diagram is shown in the following figure:

Maximum probability path calculation

All the paths of the above DAG graph are as follows:

Go/Beijing/Beijing/University/Study/Play

Go/Beijing/University/Study/Play

Go/Beijing/University/Play

Go/Peking University/Play

Because each node is weighted (Weight), for words in the prefix dictionary, its weight is its word frequency. So our problem is to ask for a maximum path so that the weight of the entire sentence is the highest.

This is a typical dynamic programming problem. First of all, we confirm the two conditions of dynamic programming:

Repeating sub-problems : For node i and its possible multiple successors j and k:

任意通过i到达j的路径的权重 = 该路径通过i的路径权重 + j的权重,即 R(i -> j) = R(i) + W(j)任意通过i到达k的路径的权重 = 该路径通过i的路径权重 + k的权重,即 R(i -> k) = R(i) + W(k)

That is, for j and k with a common predecessor node i, it is necessary to repeatedly calculate the weight of the path to i.

Optimal substructure : suppose the optimal path of the entire sentence is Rmax, the end node is x, and the multiple possible predecessor nodes are i, j, and k. The formula is as follows:

Rmax = max(Rmaxi, Rmaxj, Rmaxk) + W(x)

So the problem becomes solving Rmaxi, Rmaxj and Rmaxk, the optimal solution in the substructure is part of the global optimal solution.

As above, the optimal path is finally calculated as "Go/Peking University/Play".

HMM Hidden Markov Model

For unregistered words, jieba word segmentation adopts HMM (abbreviation of Hidden Markov Model) model for word segmentation. It regards the word segmentation problem as a sequence labeling problem, the sentence is the observation sequence, and the word segmentation result is the state sequence. The jieba word segmentation author mentioned in the issue that the parameters of the HMM model are based on the 1998 People’s Daily segmentation corpus that can be downloaded online, an MSR corpus and TXT novels collected by themselves, segmented using ICTCLAS, and finally using Python scripts to count the word frequency. .

The model consists of a five-tuple and has two basic assumptions.

Quintuple:

State value collection

Collection of observations

Initial state probability

State transition probability

State emission probability

Basic assumptions:

Homogeneity hypothesis: It is assumed that the state of the hidden Markov chain at any time t depends only on its state at the previous time t-1, and has nothing to do with the state and observations at other times, and it has nothing to do with time t.

Observation independence hypothesis: It is assumed that the observation value at any time is only related to the state of the Markov chain at that time, and has nothing to do with other observations and states.

The state value set is {B: begin, E: end, M: middle, S: single }, which represents the position of each word in the sentence. B is the start position, E is the end position, M is the middle position, and S It is a single word into a word.

The set of observations is the set of each word in our input sentence.

The initial probability of the state indicates the probability that the first word in the sentence belongs to the four states of B, M, E, and S, where the probability of E and M are both 0, because the first word can only be B or S, which is consistent with reality .

The state transition probability indicates the probability of transition from state 1 to state 2, and satisfies the homogeneity hypothesis. The structure can be represented by a nested object:

P = {    B: {E: -0.510825623765990, M: -0.916290731874155},    E: {B: -0.5897149736854513, S: -0.8085250474669937},    M: {E: -0.33344856811948514, M: -1.2603623820268226},    S: {B: -0.7211965654669841, S: -0.6658631448798212},}

P['B']['E'] represents the probability of transitioning from state B to state E (the logarithm of the probability in the structure, which is convenient for calculation) is 0.6. Similarly, P['B']['M'] The probability that the next state is M is 0.4, which means that when a word is at the beginning, the probability that the next word is at the end is higher than the probability that the next word is in the middle, which is intuitive, because two words are more than multiple words The words should be more common.

The state emission probability indicates the current state and satisfies the assumption of independence of observations. The structure is the same as above, and it can also be represented by a nested object:

P = {    B: {'突': -2.70366861046, '肃': -10.2782270947, '适': -5.57547658034},    M: {'要': -4.26625051239, '合': -2.1517176509, '成': -5.11354837278},    S: {……},    E: {……},}

P['B']['突'] means that the state is in B, and the logarithmic value of the probability that the observed word is "burst" is equal to -2.70366861046.

Finally, through the Viterbi algorithm, input the set of observation values, use the initial state probability, state transition probability, and state emission probability as parameters, and output the state value set (that is, the word segmentation result with the maximum probability). Regarding the Viterbi algorithm, this article will not be expanded in detail, and interested readers can refer to it by themselves.

The above two technologies are the technical core of our architecture. Based on this, we have made improvements to the Electron end technical architecture of NetEase Yunxin IM.

2 NetEase Yunxin IM Electron end architecture

Detailed architecture diagram

Considering that full-text search is only a function of IM, in order not to affect other IM functions, and to enable faster iteration requirements, the following architecture scheme is adopted:

On the right is the previous technical architecture. The underlying storage library uses indexDB, and the upper layer has two modules for reading and writing:

When the user actively sends a message, actively synchronizes a message, actively deletes a message, and receives a message, the message object will be synchronized to indexDB;

When the user needs to query a keyword, it will go to indexDB to traverse all the message objects, and then use indexOf to determine whether each message object contains the queried keyword (similar to LIKE).

Then, when the amount of data is large, the query speed is very slow.

On the left is a new architecture scheme with the addition of word segmentation and inverted index database. This scheme will not have any impact on the previous scheme. It just adds a layer before the previous scheme. Now:

When the user actively sends a message, actively synchronizes a message, actively deletes a message, and receives a message, the message in each message object will be synchronized to the inverted index database after word segmentation;

When a user needs to query a keyword, it will first find the idClient of the corresponding message in the inverted index database, and then find the corresponding message object in the indexDB according to the idClient and return it to the user.

Architecture advantages

The program has the following 4 advantages:

Fast speed : Inverted index is realized through search-index, which improves the search speed.

Cross-platform : Because both search-index and indexDB are based on levelDB, search-index also supports a browser environment, which provides the possibility of implementing full-text retrieval on the Web.

Independence : The inverted index database is separated from the IM main business database indexDB. When indexDB writes data, it will automatically notify the write module of the inverted index library. After segmenting the message content, insert it into the storage queue, and finally insert it into the inverted index database in turn. When a full-text search is required, through the reading module of the inverted index library, the idClient of the message object corresponding to the keyword can be quickly found, and the message object is found in the indexDB according to the idClient and returned.

Flexibility : Full-text search is accessed in the form of a plug-in. It exposes a high-level function that wraps IM and returns a new inherited and extended IM. Because of the prototype-oriented mechanism of JS, methods that do not exist in the new IM will Automatically go to the prototype chain (ie the old IM) to find it, so that the plug-in can focus on the implementation of its own method, and does not need to care about the specific version of IM, and the plug-in supports custom word segmentation functions to meet the different word segmentation needs of different users Scenes.

Effect

After using the above architecture, after our test, at the level of data volume of 20W, the search time dropped from the first ten seconds to within one second, and the search speed was about 20 times faster.

3 summary

Above, we implemented the full-text search of NetEase Yunxin IM SDK chat messages on Electron based on Nodejieba and search-index, which accelerated the search speed of chat records. Of course, we will do more optimizations in the following aspects, such as:

Write performance improvement : In actual use, it is found that when the amount of data is large, the underlying database levelDB that search-index depends on will have a write performance bottleneck, and the CPU and memory consumption will be large. After investigation, the write performance of SQLite is relatively better. From observations, the write speed is only proportional to the amount of data, and the CPU and memory are relatively stable. Therefore, you may consider compiling SQLite into a Node native module to replace it in the future. search-index.

Scalability : The current decoupling of business logic is not thorough enough. Some business fields are stored in the inverted index library. In the future, you can consider the inverted index library to find the idClient of the message object only based on keywords, and put the search with business attributes in the indexDB, which completely decouples the inverted index library from the main business library.

The above is all the sharing of this article, welcome to follow us and continue to share more technical dry goods. Finally, I hope my sharing can be helpful to everyone.

about the author

Li Ning is a senior front-end development engineer of NetEase Yunxin. He is responsible for the application development, component development and solution development of NetEase Yunxin's audio and video IM SDK. He has rich practical experience in React, PaaS component design, and multi-platform development and compilation. If you have any questions, please leave a message for exchange.

Further reading

Netease actual combat sharing | Yunxin IM SDK interface design practice

[Netease actual combat interpretation] How to realize IM chat with thousands of people?