[2021 latest edition] Elasticsearch interview questions summary (24 questions with answer analysis)

Article Directory

There are a lot of small partners who have interviewed recently. For this, I have compiled a Java interview question manual: basic knowledge, JavaOOP, Java collection/generic interview questions, Java exception interview questions, IO and NIO interview questions in Java, Java reflection, Java Serialization, Java annotations, multithreading & concurrency, JVM, Mysql, Redis, Memcached, MongoDB, Spring, SpringBoot, SpringCloud, RabbitMQ, Dubbo, MyBatis, ZooKeeper, data structures, algorithms, Elasticsearch, Kafka, microservices , Linux, etc. . You can share it with everyone to learn. [Continuous update]

Full version of Java interview questions address: [2021 latest version] Summary of real Java interview questions

Serial numbercontentAddress link
1[2021 latest edition] JavaOOP interview questions summaryhttps://blog.csdn.net/m0_48795607/article/details/115288673
2[2021 Latest Edition] Summary of Java Basic Interview Questionshttps://blog.csdn.net/m0_48795607/article/details/115485109
3[2021 latest version] Multi-threaded & concurrent interview questions summaryhttps://blog.csdn.net/m0_48795607/article/details/115489616
4[2021 latest version] JVM interview questions summaryhttps://blog.csdn.net/m0_48795607/article/details/115555086
5[2021 latest version] Summary of Mysql interview questionshttps://blog.csdn.net/m0_48795607/article/details/115561030
6[2021 latest version] Redis interview questions summaryhttps://blog.csdn.net/m0_48795607/article/details/115642129
7[2021 latest version] Memcached interview questions summaryhttps://blog.csdn.net/m0_48795607/article/details/115664662
8[2021 latest version] MongoDB interview questions summaryhttps://blog.csdn.net/m0_48795607/article/details/115672336
9[2021 latest edition] Spring interview questions summaryhttps://blog.csdn.net/m0_48795607/article/details/115738909
10[2021 latest version] Spring Boot interview questions summaryhttps://blog.csdn.net/m0_48795607/article/details/115771307
11[2021 latest version] Spring Cloud interview questions summaryhttps://blog.csdn.net/m0_48795607/article/details/115917190
12[2021 latest version] RabbitMQ interview questions summaryhttps://blog.csdn.net/m0_48795607/article/details/116064045
13[2021 latest version] Dubbo interview questions summaryhttps://blog.csdn.net/m0_48795607/article/details/116237861
14[2021 Latest Version] Summary of MyBatis Interview Questionshttps://blog.csdn.net/m0_48795607/article/details/116427170
15[2021 latest version] ZooKeeper interview questions summaryhttps://blog.csdn.net/m0_48795607/article/details/116458096
16[2021 latest version] Summary of data structure interview questionshttps://blog.csdn.net/m0_48795607/article/details/116461620
17[2021 latest version] Summary of Algorithm Interview Questionshttps://blog.csdn.net/m0_48795607/article/details/116461620
18[2021 latest version] Kafka interview questions summarynot up-to-date
19[2021 Latest Version] Summary of Microservice Interview Questionsnot up-to-date
20[2021 latest version] Linux interview questions summarynot up-to-date

1. How much does Elasticsearch know about your company's ES cluster architecture, index data size, number of shards, and some tuning methods.

answer:

Interviewer: I would like to know the ES usage scenarios and scales that the company has been exposed to before the applicants, and whether they have done relatively large-scale index design, planning, and tuning.

answer:

Answer truthfully in combination with your own practical scenarios.

For example: ES cluster architecture has 13 nodes, the index is 20+ indexes according to different channels, and the daily increment is 20+ according to the date.

Index: 10 shards, daily increment of 100 million + data, daily index size control for each channel: within 150GB.

Tuning methods only at the index level:

1.1. Tuning at the design stage

1. According to the incremental business needs, create an index based on a date template, and roll the index through the roll over API;

2. Use aliases for index management;

3. Do force_merge operations on the index regularly every morning to free up space;

4. The cold and hot separation mechanism is adopted to store hot data to SSD to improve retrieval efficiency; cold data is periodically shrunk operation to reduce storage;

5. Use curator to manage the life cycle of the index;

6. Only for the fields that need word segmentation, set a reasonable word segmenter;

7. The Mapping stage fully combines the attributes of each field, whether it needs to be retrieved, whether it needs to be stored, etc.

1. Set the number of copies before writing to 0;

2. Turn off refresh_interval before writing and set it to -1 to disable the refresh mechanism;

3. During the writing process: take bulk writing;

4. The number of recovery copies and refresh interval after writing;

5. Try to use the automatically generated id.

1.3, query tuning

1. Disable wildcard;

2. Disable batch terms (hundreds or thousands of scenarios);

3. Make full use of the inverted index mechanism, and the keyword type can be keyword as much as possible;

4. When the amount of data is large, you can finalize the index based on time before searching;

5. Set up a reasonable routing mechanism.

1.4, other tuning

Deployment tuning, business tuning, etc.

In the above mentioned part, the interviewer basically has an assessment of your previous practice or operation and maintenance experience.

2. What is the inverted index of Elasticsearch?

answer:

Interviewer: I want to know your understanding of basic concepts.

Answer: Just a simple explanation.

Our traditional search is to find the position of the corresponding keyword through the article, traversing one by one.

The inverted index uses a word segmentation strategy to form a mapping relationship table between words and articles. This dictionary + mapping table is an inverted index.

With the inverted index, it is possible to retrieve articles with the efficiency of o(1) time complexity, which greatly improves retrieval efficiency.


Academic answering methods:

The inverted index, in contrast to which words are contained in an article, starts from the word and records which documents the word has appeared in. It consists of two parts-a dictionary and an inverted table.

Bonus item: The underlying implementation of the inverted index is based on: FST (Finite State Transducer) data structure.

The data structure that lucene has used extensively since the 4+ version is FST. FST has two advantages:

1. Small space occupation. By reusing the word prefix and suffix in the dictionary, the storage space is compressed;

2. The query speed is fast. O(len(str)) query time complexity.

3. What should I do if there is too much Elasticsearch index data, how to tune and deploy?

answer:

Interviewer: I want to understand the operation and maintenance capabilities of large amounts of data.

Answer: The planning of index data should be planned in the early stage, as the so-called "design first, coding second", so as to effectively avoid the sudden increase in data leading to insufficient cluster processing capabilities to cause online customer retrieval or other business to be affected .

How to tune, as mentioned in question 1, here is a detailed description:

3.1 Dynamic index level

Rolling index creation based on template + time + rollover api, for example: design phase definition: the template format of blog index is: blog_index_ timestamp format, with daily incremental data.

The advantage of this is that the data volume of a single index will not be very large due to the surge in data volume, which is close to the 32th power of online 2-1, and the index storage has reached TB+ or even larger.
Once a single index is large, various risks such as storage will follow, so consider ahead of time + avoid it early.

3.2 Storage level

Cold and hot data are stored separately, hot data (such as the data of the last 3 days or a week), and the rest are cold data. For cold data that will not be written to new data, you can consider regular force_merge plus shrink compression operations to save storage space and retrieval efficiency.

3.3 Deployment level

Once there is no plan before, here is an emergency strategy. Combined with ES's own features of supporting dynamic expansion, the way of dynamically adding machines can alleviate the pressure of the cluster. Note: If the previous master node and other planning are reasonable, the dynamic addition can be completed without restarting the cluster.

4. How does Elasticsearch implement master election?

answer:

Interviewer: If you want to understand the underlying principles of ES clusters, you no longer only focus on the business level.

answer:

Prerequisites:

1. Only the node of the candidate master node (master: true) can become the master node.

2. The purpose of the minimum number of master nodes (min_master_nodes) is to prevent split brain.

I read all kinds of online analysis versions and source code analysis books.

After checking the code, the core entry is findMaster, and the master node is selected to successfully return the corresponding Master, otherwise it returns null. The election process is roughly described as follows:

Step 1: Confirm that the number of candidate master nodes meets the standard, the value set in elasticsearch.yml

discovery.zen.minimum_master_nodes;

Step 2: Comparison: First determine whether you are qualified for master, and the candidate who is qualified for the master node will return first; if both nodes are candidate master nodes, the smaller id value will be the master node. Note that the id here is of string type.

Digression: How to get the node id.

1GET /_cat/nodes?v&h=ip,port,heapPercent,heapMax,id,name

2ip port heapPercent heapMax id name

5. Describe in detail the process of Elasticsearch indexing documents.

answer:

Interviewer: If you want to understand the underlying principles of ES, you no longer only focus on the business level.

answer:

The index document here should be understood as the process of writing the document into the ES and creating the index.

Document writing includes: single document writing and bulk bulk writing, here is just an explanation: single document writing process.

Remember this picture in the official document.


The first step: the client writes a node in the cluster to write data and send a request. (If no routing/coordinating node is specified, the requested node plays the role of routing node.)

Step 2: After node 1 receives the request, it uses document _id to determine that the document belongs to shard 0. The request will be forwarded to another node, assuming node 3. Therefore, the primary fragment of fragment 0 is allocated to node 3.

The third step: Node 3 executes the write operation on the primary shard. If it succeeds, it forwards the request to the replica shards of node 1 and node 2 in parallel, and waits for the result to return. All replica shards report success, node 3 will report success to the coordinating node (node ​​1), and node 1 will report writing success to the requesting client.

If the interviewer asks again: The process of document acquisition in the second step?

Answer: Obtained with the aid of a routing algorithm. The routing algorithm is a process of calculating the target's shard id based on the route and document id.

1shard = hash(_routing) % (num_of_primary_shards)

6. Describe the process of Elasticsearch search in detail?

answer:

Interviewer: If you want to understand the underlying principles of ES search, you no longer only focus on the business level.

answer:

The search is disassembled into two phases of "query then fetch".

The purpose of the query phase: locate the position, but not take it.

The steps are disassembled as follows:

1. Assuming that an index data has 5 masters + 1 replicas and a total of 10 shards, a request will hit one of the (primary or replica shards).

2. Each shard is queried locally, and the result is returned to the local ordered priority queue.

3. The result of the second step is sent to the coordinating node, and the coordinating node generates a global sorted list. The purpose of the fetch stage: fetch data.

The routing node obtains all documents and returns them to the client.

7. When Elasticsearch is deployed, what are the optimization methods for Linux settings?

answer:

Interviewer: I want to know the operation and maintenance capabilities of ES clusters.

answer:

1. Turn off cache swap

2. The heap memory is set to: Min (node ​​memory/2, 32GB)

3. Set the maximum number of file handles

4. Thread pool + queue size is adjusted according to business needs

5. Disk storage raid mode-storage conditionally uses RAID10 to increase single-node performance and avoid single-node storage failures

8. What is the internal structure of lucence?

answer:

Interviewer: I want to know the breadth and depth of your knowledge.

Answer:


Lucene has two processes of indexing and searching, including three main points: index creation, indexing and searching. It can be expanded based on this context.

Recently interviewed some companies, I was asked about Elasticsearch and search engine related questions, and my own summarized answers.

9. How does Elasticsearch implement the Master election?

answer:

1. The choice of Elasticsearch is the responsibility of the ZenDiscovery module, which mainly includes Ping (the nodes find each other through this RPC) and Unicast (the unicast module contains a host list to control which nodes need to be pinged);

2. Sort all nodes that can become master (node.master: true) according to the nodeId dictionary, and each node will rank the nodes it knows in each election, and then select the first (0th) node. For the time being, think it is the master node.

3. If the number of votes for a node reaches a certain value (the number of master nodes can be n/2+1) and the node itself elects itself, then this node is the master. Otherwise, re-election until the above conditions are met.

4. Supplement: The responsibilities of the master node mainly include the management of clusters, nodes, and indexes, and are not responsible for document-level management; the data node can turn off the http function*.

10. Among the nodes in Elasticsearch (for example, there are 20 in total), 10 of them choose one master, and the other 10 choose another master. What should I do?

answer:

1. When the number of cluster master candidates is not less than 3, the split-brain problem can be solved by setting the minimum number of votes (discovery.zen.minimum_master_nodes) to exceed more than half of all candidate nodes;

2. When the number of candidates is two, only one master candidate can be modified, and the others are used as data nodes to avoid split-brain problems

11. How does the client select a specific node to execute the request when connecting to the cluster?

answer:

1. TransportClient uses the transport module to remotely connect to an elasticsearch cluster. It does not join the cluster, but simply obtains one or more initialized transport addresses, and communicates with these addresses in a polling manner.

12. Describe in detail the process of Elasticsearch indexing documents.

answer:

By default, the coordinating node uses the document ID to participate in the calculation (routing is also supported) in order to provide appropriate fragmentation for routing.

shard = hash(document_id) % (num_of_primary_shards)

1. When the node where the shard is located receives the request from the coordinating node, it will write the request to the Memory Buffer, and then write it to the Filesystem Cache at regular intervals (every 1 second by default). This process from MomeryBuffer to Filesystem Cache is Called refresh;

2. Of course, in some cases, the data in Momery Buffer and Filesystem Cache may be lost. ES uses the translog mechanism to ensure the reliability of the data. The implementation mechanism is that after receiving the request, it will also be written to the translog. When the data in the Filesystem cache is written to the disk, it will be cleared. This process is called flush;

3. During the flush process, the buffer in the memory will be cleared, and the content will be written to a new segment. The fsync of the segment will create a new commit point and flush the content to the disk. The old translog will be deleted and a new segment will start. The new translog.

4. The timing of flush triggering is timing trigger (default 30 minutes) or translog becomes too large (default is 512M).


Supplement: Regarding Lucene segment:

1. The Lucene index is composed of multiple segments, and the segment itself is a fully functional inverted index.

2. The segment is immutable, allowing Lucene to incrementally add new documents to the index without having to rebuild the index from scratch.

3. For each search request, all segments in the index will be searched, and each segment will consume CPU clock cycles, file handles and memory. This means that the greater the number of segments, the lower the search performance.

4. In order to solve this problem, Elasticsearch will merge small segments into a larger segment, submit the new merged segment to disk, and delete the old small segments.

13. Describe in detail the process of Elasticsearch updating and deleting documents.

answer:

1. Delete and update are also write operations, but the documents in Elasticsearch are immutable, so they cannot be deleted or changed to show their changes;

2. Each segment on the disk has a corresponding del file. When the delete request was sent, the document was not actually deleted, but was marked as deleted in the del file. The document can still match the query, but it will be filtered out in the results. When the segments are merged, the documents marked as deleted in the del file will not be written into the new segment.

3. When a new document is created, Elasticsearch will assign a version number to the document. When the update is performed, the old version of the document is marked as deleted in the del file, and the new version of the document is indexed to a new segment. The old version of the document can still match the query, but it will be filtered out in the results.

answer:

1. The search is executed as a two-stage process, which we call Query Then Fetch;

2. In the initial query phase, the query will be broadcast to every shard copy (primary shard or replica shard) in the index. Each shard performs a search locally and builds a priority queue of matching documents with a size of from + size.

PS: When searching, Filesystem Cache will be queried, but some data is still in MemoryBuffer, so the search is near real-time.

3. Each shard returns the IDs and ranking values ​​of all documents in their respective priority queues to the coordinating node, and it merges these values ​​into its own priority queue to generate a global sorted result list.

4. The next step is the retrieval phase. The coordination node identifies which documents need to be retrieved and submits multiple GET requests to the relevant shards. Each shard loads and enriches the document, and then returns the document to the coordinating node if necessary. Once all the documents have been retrieved, the coordinating node returns the results to the client.

5. Supplement: The search type of Query Then Fetch refers to the data of this shard when scoring the relevance of documents. This may not be accurate enough when the number of documents is small. DFS Query Then Fetch adds a pre-query processing, Ask Term and Document frequency, this score is more accurate, but the performance will be worse.

15. In Elasticsearch, how do you find the corresponding inverted index based on a word?

answer:

SEE:

Lucene index file format (1)

Lucene index file format (2)

16. When Elasticsearch is deployed, what are the optimization methods for Linux settings?

answer:

1. A 64GB memory machine is ideal, but 32GB and 16GB machines are also very common. Less than 8GB will backfire.

2. If you have to choose between faster CPUs and more cores, it is better to choose more cores. The additional concurrency provided by multiple cores far outweighs the slightly faster clock frequency.

3. If you can afford an SSD, it will go far beyond any rotating media. SSD-based nodes have improved query and index performance. If you can afford it, SSD is a good choice.

4. Even if the data centers are close at hand, avoid clusters spanning multiple data centers. It is absolutely necessary to avoid clusters spanning large geographic distances.

5. Make sure that the JVM running your application and the JVM of the server are exactly the same. In several places in Elasticsearch, Java's local serialization is used.

6. By setting gateway.recover_after_nodes, gateway.expected_nodes, gateway.recover_after_time, you can avoid excessive fragment swaps when the cluster restarts, which may shorten data recovery from hours to seconds.

7. Elasticsearch is configured to use unicast discovery by default to prevent nodes from accidentally joining the cluster. Only nodes running on the same machine will automatically form a cluster. It is better to use unicast instead of multicast.

8. Don't arbitrarily modify the size of the garbage collector (CMS) and each thread pool.

9. Give (less than) half of your memory to Lucene (but no more than 32GB!), set through the ES_HEAP_SIZE environment variable.

10. Swap memory to disk is fatal to server performance. If the memory is swapped to disk, a 100 microsecond operation may become 10 milliseconds. Think about the cumulative operating delays of 10 microseconds. It is not difficult to see how terrible swapping is for performance.

11. Lucene uses a lot of files. At the same time, Elasticsearch also uses a large number of sockets for communication between nodes and HTTP clients. All of this requires sufficient file descriptors. You should increase your file descriptor and set a large value, such as 64,000.

Supplement: performance improvement methods in the indexing phase

1. Use batch requests and adjust their size: 5-15 MB of batch data each time is a good starting point.

2. Storage: use SSD

3. Segment and merge: The default value of Elasticsearch is 20MB/s, which should be a good setting for mechanical disks. If you are using SSD, consider increasing it to 100-200 MB/s.

If you are doing batch import and don't care about searching at all, you can completely turn off the merge current limit. In addition, you can increase the index.translog.flush_threshold_size setting from the default 512MB to a larger value, such as 1GB, which can accumulate larger segments in the transaction log when a flush is triggered.

4. If your search results do not require near real-time accuracy, consider changing the index.refresh_interval of each index to 30s.

5. If you are doing bulk import, consider turning off replicas by setting index.number_of_replicas: 0.

17. Regarding GC, what should I pay attention to when using Elasticsearch?

answer:

1. SEE: https://elasticsearch.cn/article/32

2. The index of the inverted dictionary requires permanent memory and cannot be GC. It is necessary to monitor the growth trend of segment memory on the data node.

3. All kinds of caches, such as field cache, filter cache, indexing cache, bulk queue, etc., should be set to a reasonable size, and should be based on the worst case to see whether the heap is sufficient, that is, all kinds of caches are all full At that time, is there still heap space that can be allocated to other tasks? Avoid using clear cache and other "self-deception" methods to release memory.

4. Avoid searching and aggregation that returns a large number of result sets. Scenarios that really need to pull a lot of data can be implemented using scan & scroll api.

5. The resident memory of cluster stats cannot be scaled horizontally. Super-large-scale clusters can be split into multiple clusters and connected through tribe nodes.

6. If you want to know if the heap is enough, you must combine the actual application scenarios and continuously monitor the heap usage of the cluster.

18. How does Elasticsearch realize the aggregation of large amounts of data (on the order of hundreds of millions)?

answer:

The first approximate aggregation provided by Elasticsearch is the cardinality metric. It provides the cardinality of a field, that is, the number of distinct or unique values ​​of the field. It is based on the HLL algorithm. HLL will first perform a hash operation on our input, and then estimate the probability based on the bits in the result of the hash operation to get the base. Its characteristics are: configurable precision, used to control the use of memory (more precise = more memory);

The accuracy of a small data set is very high; we can set the fixed memory usage required for deduplication through configuration parameters. Regardless of the unique value of thousands or billions, memory usage is only related to the accuracy of your configuration.

19. In the case of concurrency, what if Elasticsearch guarantees consistent reading and writing?

answer:

1. Optimistic concurrency control can be used through the version number to ensure that the new version will not be overwritten by the old version, and the application layer will handle specific conflicts;

2. In addition, for write operations, the consistency level supports quorum/one/all, and the default is quorum, that is, write operations are allowed only when most shards are available. But even if most of them are available, there may be a write copy failure due to network and other reasons, so that the copy is considered a failure, and the shard will be rebuilt on a different node.

3. For read operations, you can set replication to sync (default), which makes the operation return only after the primary and replica shards are completed; if you set replication to async, you can also set the search request parameter _preference to primary to query the primary shard to ensure that the document is the latest version.

20. How to monitor Elasticsearch cluster status?

answer:

Marvel allows you to easily monitor Elasticsearch through Kibana. You can view the health and performance of your cluster in real time, as well as analyze past cluster, index, and node metrics.

answer:

22. Tell me about your personalized search plan?

answer:

SEE implements personalized search based on word2vec and Elasticsearch.

23. Do you understand the dictionary tree?

answer:

The commonly used dictionary data structure is shown below.


The core idea of ​​Trie is to change space for time, and use the common prefix of the string to reduce the query time overhead in order to achieve the purpose of improving efficiency. It has 3 basic properties:

1. The root node does not contain characters, and each node except the root node contains only one character.

2. From the root node to a node, the characters passing through the path are connected to form the character string corresponding to the node.

3. All child nodes of each node contain different characters.


1. It can be seen that the number of nodes in each layer of the trie tree is 26^i. So in order to save space, we can also use dynamic linked lists or arrays to simulate dynamics. The space cost will not exceed the number of words × the length of the words.

2. Implementation: Open an array of the size of a letter set for each node, hang a linked list for each node, and record the tree using the notation of left son and right brother;

3. For the Chinese dictionary tree, the child nodes of each node are stored in a hash table, so that too much space is not wasted, and the query speed can retain the complexity of the hash O(1)

24. How does spelling error correction work?

answer:

1. Spelling error correction is realized based on edit distance; edit distance is a standard method, which is used to indicate the minimum number of operation steps to convert from one string to another after insertion, deletion and replacement operations;

2. The calculation process of the edit distance: For example, to calculate the edit distance between batyu and beauty, first create a 7×8 table (batyu length is 5, coffee length is 6, each plus 2), and then fill in black in the following positions digital. The calculation process of other grids is to take the minimum of the following three values:

If the uppermost character is equal to the leftmost character, it is the upper left number. Otherwise, the number on the upper left is +1.

(0 for 3,3)

The number on the left is +1 (for 3 and 3 grids, it is 2)

The number above is +1 (for 3 and 3 grids, it is 2)

Finally, the value in the lower right corner is the value of edit distance 3.


For spelling error correction, we consider constructing a metric space (Metric Space), in which any relationship meets the following three basic conditions:

d(x,y) = 0-If the distance between x and y is 0, then x=y

d(x,y) = d(y,x) – the distance from x to y is equivalent to the distance from y to x

d(x,y) + d(y,z) >= d(x,z)-triangle inequality

1. According to the triangle inequality, another character whose distance from the query is within n is converted to B, and the maximum distance from A is d+n, and the minimum is dn.

2. The construction process of the BK tree is as follows:

Each node has any number of child nodes, and each edge has a value representing the edit distance. The label n on the edge of all child nodes to the parent node indicates that the edit distance is exactly n.
For example, we have a tree whose parent node is "book" and two child nodes "cake" and "books". The edge from "book" to "books" is labeled 1, and the edge from "book" to "cake" is labeled 4.

After constructing the tree from the dictionary, whenever you want to insert a new word, calculate the edit distance between the word and the root node, and find the edge with the value d (neweord, root). Recursively compare each child node until there are no child nodes, you can create a new child node and save the new word there.

For example, to insert "boo" into the tree in the above example, we first check the root node, find the edge with d("book", "boo") =1, and then check the child nodes of the edge labeled 1 to get the word" books". We then calculate the distance d("books", "boo")=2, then insert the new word after "books", and the side label is 2.

3. Query similar words as follows: Calculate the edit distance d between the word and the root node, and then recursively find the edges of each child node labeled dn to d+n (inclusive). If the distance d between the checked node and the search word is less than n, return to the node and continue the query.

For example, if you enter cape and the maximum tolerance distance is 1, first calculate the edit distance from the root d("book", "cape")=4, and then find the edit distance from the root node of 3 to 5, this will be found After the cake node, calculate d("cake","cape")=1, return to cake if the condition is met, and then find the edit distance from the cake node is 0 to 2, find the cape and cart nodes respectively, and get the cape The result that satisfies the conditions.

to sum up

The answer to the interview question is analyzed and the complete document is obtained: Elasticsearch interview question summary