Elasticsearch (advanced) document processing

Article Directory

Document conflict

When we update a document using the index API, we can read the original document at once, make our modifications, and then re-index the entire document. The most recent index request will win: no matter which document is indexed last, it will be uniquely stored in Elasticsearch. If other people make changes to this document at the same time, their changes will be lost.

In many cases this is not a problem. Maybe our primary data store is a relational database, we just copy the data into Elasticsearch and make it searchable. Maybe the chance of two people changing the same document at the same time is very small. Or it is not a serious problem for our business to occasionally lose changes.

But sometimes the loss of a change is very serious. Imagine that we use Elasticsearch to store the quantity of product inventory in our online store. Every time we sell a product, we reduce the inventory quantity in Elasticsearch. One day, the management decided to make a promotion. Suddenly, we had to sell several products in one second. Suppose there are two web programs running in parallel, each of which handles the sale of all goods at the same time

Insert picture description here


The changes made by web_1 to stock_count have been lost because web_2 does not know that its copy of stock_count has expired. As a result, we will think that there is more than the actual quantity of goods in inventory, because the inventory goods sold to customers do not exist, we will let them very disappointed.

The more frequent the changes, the longer the gap between reading data and updating data, and the more likely it is to lose changes.
In the database world, two methods are usually used to ensure that changes are not lost during concurrent updates:

Pessimistic concurrency control:
This method is widely used by relational databases. It assumes that change conflicts may occur, so access to resources is blocked to prevent conflicts. A typical example is to lock a row of data before reading it to ensure that only the thread that placed the lock can modify the row of data.

Optimistic concurrency control:
This method used in Elasticsearch assumes that conflicts are impossible and does not block the operation being attempted. However, if the source data is modified during reading and writing, the update will fail. The application will then decide how to resolve the conflict. For example, you can retry the update, use new data, or report the situation to the user

Optimistic concurrency control

Elasticsearch is distributed. When a document is created, updated, or deleted, the new version of the document must be copied to other nodes in the cluster. Elasticsearch is also asynchronous and concurrent, which means that these replication requests are sent in parallel, and may be out of order when they arrive at the destination. Elasticsearch needs a way to ensure that the old version of the document will not overwrite the new version.

When we discussed index, GET, and delete requests earlier, we pointed out that each document has a _version (version) number, and the version number is incremented when the document is modified. Elasticsearch uses this version number to ensure that changes are executed in the correct order. If the old version of the document arrives after the new version, it can be simply ignored.

We can use the version number to ensure that conflicting changes in the application will not cause data loss. We do this by specifying the version number of the document we want to modify. If the version is not the current version number, our request will fail.

The current version number is 1

Insert picture description here


after modification, the version number is changed from 1 to 2

Insert picture description here


to test the version number.

Insert picture description here


The old version of es uses version, but the new version does not support it, the following error will be reported, prompting us to use if_seq_no and if_primary_term

{
    "error": {
        "root_cause": [
            {
                "type": "action_request_validation_exception",
                "reason": "Validation Failed: 1: internal versioning can not be used for optimistic concurrency control. Please use `if_seq_no` and `if_primary_term` instead;"
            }
        ],
        "type": "action_request_validation_exception",
        "reason": "Validation Failed: 1: internal versioning can not be used for optimistic concurrency control. Please use `if_seq_no` and `if_primary_term` instead;"
    },
    "status": 400
}

Solution:

According to the first picture

Insert picture description here


To judge, you can use this method

Insert picture description here

External system version control

A common setting is to use other databases as the main data storage and Elasticsearch for data retrieval. This means that all changes to the main database need to be replicated to Elasticsearch when they occur. If multiple processes are responsible for this data synchronization, you may encounter To a concurrency problem similar to the previously described.

If your main database already has a version number — or a field value that can be used as a version number, such as timestamp — then you can reuse these same version numbers and version numbers in Elasticsearch by adding version_type=external to the query string Must be an integer greater than zero and less than 9.2E+18 — a positive value of long type in Java.

The handling of the external version number is somewhat different from the internal version number that we discussed earlier. Elasticsearch does not check whether the current _version is the same as the version number specified in the request, but checks whether the current _version is less than the specified version number. If the request is successful, the external version number is stored as the new _version of the document.

Do a test:

First check the document and find that the version number is 2

Insert picture description here


Success at this time

Insert picture description here


The query results are as follows:

The version number is 3, and the data is also displayed

Insert picture description here