What is ZooKeeper

What is ZooKeeper?

ZooKeeper is a distributed, open source distributed application coordination service. It is an open source implementation of Google’s Chubby. It is the manager of the cluster.

Monitor the status of each node in the cluster and proceed with the next reasonable operation based on the feedback submitted by the node. In the end, a simple and easy-to-use interface and a system with high performance and stable function

Provided to users.

The client's read request can be processed by any machine in the cluster. If the read request has a listener registered on the node, the listener is also controlled by the connected zookeeper

The machine handles it. For write requests, these requests will be sent to other zookeeper machines at the same time and after they reach an agreement, the request will return success. Therefore, with

As the number of zookeeper cluster machines increases, the throughput of read requests will increase but the throughput of write requests will decrease.

Orderliness is a very important feature in zookeeper. All updates are globally ordered. Each update has a unique timestamp. This timestamp is called

zxid (Zookeeper Transaction Id). The read request will only be ordered relative to the update, that is, the return result of the read request will contain the latest zookeeper

So in essence,  ZooKeeper is a third party, also known as an intermediary. It builds a platform through which all other processes can communicate indirectly.

What does ZooKeeper provide

1. File system

2. Notification mechanism

Data model of ZooKeeper     Computers are actually used to process or store data, and so are the software that runs on them. As the coordinator of multiple processes, zookeeper It must be impossible to run. The storage of data and the placement of objects are the same, and they cannot be thrown around at will. This way, it occupies a lot of space, makes it unsightly, and difficult to find. So there must be a certain hierarchical structure. This is the data structure of computer professional courses. The simplest data structure is an array or linked list. They are called linear tables. They are one-dimensional and have a linear relationship, that is, the order of the front and back. The advantages are simple, but the disadvantages The function is not strong enough. Then there is the tree. It can be considered as a two-dimensional relationship. The left and right are brothers, and the upper and lower are sub-relationships, so it has a subordinate relationship. It is  a structure that takes into account both function and complexity . All kinds of organizational structures in real life are tree-shaped. The more complex is the graph. It is a file-like structure, which can be considered multi-dimensional. Since any node can be connected, it expresses a multilateral relationship. Although meritorious It is powerful but also very complicated. In reality, railways and interpersonal relations are all syllables. Of course, these are three types of data structures, each of which can be divided into many types. There are many variants of trees, although they are all called trees, but some of the differences are still very different. Large. ZooKeeper chose the tree as its self-storage data structure. In fact, it is very similar to the file system, as shown in the figure below:

When it comes to data, you cannot do without adding, deleting, modifying, and checking. For the tree,   adding is adding new nodes to the tree, deleting is deleting a node from the tree, and changing is to modify the storage on a node in the tree. For data, search is to find a node in the tree to read the data stored on it.    In white, the tree representation is a kind of structure. The real data is placed on the nodes, either leaf sub-nodes or non-leaf sub-nodes.      ZooKeeper capabilities     We start with the most common scenarios to understand how zookeeper is used at a macro level, and what capabilities it should have.       Scene one:       There are two application processes A and B. A processes the data first, notifies B after the processing is complete, and B then processes the data. How should we use zookeeper to accomplish this? Let's analyze it together.      First, process A connects to zookeeper and creates a node on it to represent the existence of itself. Let's say the node name is foo.       Then set a data called    doing    on the node , which indicates that the data is being processed. After processing for a while, update the data on the node to    done    .      In this way, the work of process A is finished. But how does this affect process B? We know that what zookeeper accomplishes is    indirect communication between processes, that is, there is no touch between processes.    Therefore, only the nodes in this tree can be used.    Process B should also connect to zookeeper, and then find the foo node, and see if the data on it has changed from doing to done, if it is self, start      Process the data, if not then continue to wait.       The problem is that process B can't stare at the foo node by itself. This is too tired and hurting, and it has to do other things. Who should do this? Obviously it is zookeeper.      So process B said to zookeeper, you stare at the foo node. When it becomes done, notify me and I will start.       Therefore,    zookeeper needs to have the ability to track and notify other processes    . This corresponds to a professional term in zookeeper, called    Wat ch    .      The function and usage of Wat ch are the same as described above. That is, process B finds the foo node and puts a Wat ch on it.      In this way, zookeeper knows that process B pays more attention to the foo node, so zookeeper stares at the foo node, and there is a stir, and then the process B is notified.         It should be noted that    this Wat ch is one-time, that is, it can only be used once. In other words, after zookeeper notifies process B, Wat ch is used up, and it will not be notified again in the future.      What if process B still needs to be notified? Very simple, it is on the foo node    recapture causes a new Wat ch can be. If you continue in this way, you can guarantee that you will be notified immediately.      I think the reason why this Wat ch is designed to be one-time is that zookeeper does not want to make yourself too tired. If you keep your eyes open and stare at too many things for too long, you will be really tired.      In addition, when zookeeper notifies process B, it can send the data stored in the foo node.       Close friends may have discovered that zookeeper can actively send notifications or push data to process B, indicating that the connection between zookeeper and process B needs to be maintained.       Because the location of process B is more random, it is originally a business process. Once the connection is disconnected, like a broken zheng, zookeeper can no longer find process B.      However, the location of zookeeper is fixed. Once the connection is disconnected, process B can initiate a connection request to zookeeper again.       If it is short enough, process B should be able to retrieve everything it once owned on zookeeper.       This involves the session, so zookeeper also has a certain session continuation capability, so that the original session can be retrieved when the disconnection time is running out.       Therefore, zookeeper should have the    ability to monitor nodes, notify processes, maintain connections, and continue sessions    .      Scene:       Sometimes for high usability or high performance, one application is usually run in multiple copies. If four copies are run, there are four processes, namely A, B, C, and D.      When a call comes over, and it is found that A, B, C, and D can all be called, then one call can be selected according to the configured load balancing strategy.       Assuming that the machine where the process D is located is unfortunately powered off, it is actually D hung up. Then if you call it again at this time, you will find that only A, B, and C can be adjusted, and D auto does not exist.       This is actually part of Dubbo's function, so how to implement it based on zookeeper? Let's analyze it as usual.      Since zookeeper is based on a tree-shaped data structure, it is still necessary to talk about nodes. When process A starts, it needs to connect to zookeeper, and then create a node to represent self.      The name of the node and the data stored on the node can be determined according to the actual situation, at least including the IP and port information of the process running. Processes B, C, and D do the same thing.      If the nodes of processes A, B, C, and D are all located under the same node, then after a call is over, as long as the node is found and all its sub-nodes are read out, all the available nodes will be obtained. Called process information.       If at a certain moment, process D hangs up, then the node corresponding to process D under the node should be automatically deleted by zookeeper. This professional terms zookeeper ⾥, called    temporary node    (    Ephemeral the Node    ). Then the corresponding one is naturally a permanent node.      In fact, the working process is like this. After the business process is started, a connection is established with zookeeper, and then a temporary node is created in zookeeper and the relevant information is written. Then maintain the connection with zookeeper through periodic hops.      Once the business process is down, zookeeper will not accept a jump. After a certain period of time, zookeeper will delete the corresponding temporary node, indicating that the business process is no longer available.       Dubbo's approach is to integrate the access name and IP port information and the information we set into a string similar to a URL, and then use this string as the name to create a temporary node.       Temporary nodes are not allowed to have child nodes, only permanent nodes can.           .        Zookeeper file system 3        Zookeeper provides a multi-level node namespace (nodes are called znodes). Different from the file system, these nodes can all set the associated data. In the file system, only the file node can store the data. The directory node is not. In order to ensure high throughput and low latency, Zookeeper maintains this tree-like directory structure in memory. This feature prevents Zookeeper from being used to store large amounts of data. The upper limit of data storage for each node is 1M.        4. Four types of znode 4        1. PERSISTENT-Persistent Directory Node         After the client disconnects from zookeeper, the node still exists         2. PERSISTENT_SEQUENTIAL-Persistent sequence number directory node         After the client is disconnected from zookeeper, the node still exists, but Zookeeper gives the node name a sequential number

3. EPHEMERAL-temporary directory node

After the client disconnects from zookeeper, the node is deleted

4. EPHEMERAL_SEQUENTIAL-temporary sequence number directory node

After the client disconnects from zookeeper, the node is deleted, but Zookeeper gives the node name a sequential number

5. Zookeeper notification mechanism

The client will create a watcher event for a znode. When the znode changes, these clients will be notified by zk, and then the client can follow

Znode changes to make business changes and so on.

6. What does Zookeeper do?

Naming service

Configuration management

Cluster management

Distributed lock

Queue management

7.zk naming service (file system)

Naming service refers to obtaining the address of a resource or service through a specified name, and using zk to create a global path, that is, the only path, this path can be used as

A name, pointing to the cluster in the cluster, the address of the service provided, or a remote object, etc.

8.zk configuration management (file system, notification mechanism )

The program is deployed on different machines in a distributed manner, and the configuration information of the program is placed under the znode of zk. When the configuration changes, that is, when the znode changes, you can

In order to change the content of a directory node in zk, use watcher to notify each client to change the configuration.

9. Zookeeper cluster management (file system, notification mechanism)

The so-called cluster management does not care about two points: whether there are machines to withdraw and join, and to elect the master.

For the first point, all machines agree to create a temporary directory node in the directory, and then monitor the subnode change message of the directory node. Once a machine hangs up, the machine and

Zookeeper's connection is disconnected, the temporary directory node it created is deleted, and all other machines are notified: a brother directory is deleted, so everyone knows

Dao: It's on board.

The addition of new machines is similar. All machines receive notifications: the new brother’s directory has been added, and the highcount has been added. For the second point, we slightly change it. All machines

Create a temporary sequential numbering directory node, and select the machine with the smallest number as the master each time.

10. Zookeeper distributed lock (file system, notification mechanism)

With the consistent file system of zookeeper, the problem of locking becomes easy. Lock services can be divided into two categories, one is to maintain exclusiveness, and the other is to control timing.

For the first category, we regard a znode on zookeeper as a lock, which is implemented by the method of createznode. All clients create

The /distribute_lock node, the client that is finally successfully created also owns this lock. After you use it, delete the distribute_lock node you created and release it


For the second category, /distribute_lock already exists in advance, and all clients create temporary sequential numbering directory nodes under it, like the master, with the smallest number

Get the lock, delete it when you use it up, and do it in turn.

11. The process of acquiring distributed locks

Create a temporary sequence node under the locker node when acquiring a distributed lock, and delete the temporary node when releasing the lock. The client calls the createNode method in the locker

Create a temporary sequence node, and then call getChildren("locker") to get all the sub-nodes under the locker. Note that no Watcher is set at this time.

After the client has obtained all the subnode paths, if it finds that the self-created node has the smallest serial number among all the created subnodes, then it is considered that the client has obtained it.

lock. If you find that the node created by yourself is not the smallest among all the subnodes of the locker, it means that you have not obtained the lock yet, and the client needs to find the smaller one.

Node, then call exist() method to it, and register event listener to it at the same time.

After that, let the watched node be deleted, and the Watcher of the client will receive the corresponding notification. At this time, it will judge whether the self-created node is the locker sub-node in sequence again.

The smallest number, if it is, the lock is obtained, if not, repeat the above steps to continue to obtain a smaller node and register to listen. In the current process, we still need

Many logical judgments.

The implementation of the code is mainly based on mutual exclusion locks. The key logic for obtaining distributed locks lies in BaseDistributedLock, which implements the detailed implementation of distributed locks based on Zookeeper.


12. Zookeeper queue management (file system, notification mechanism)

Two types of queues:

Synchronous queue, when the members of a queue are gathered, this queue can be used, otherwise it will wait for all members to arrive.

The queue enters and dequeues operations in a FIFO manner.

The first category is to create a temporary directory node under the agreed directory, and monitor whether the number of nodes is the number we require.

The second category is consistent with the basic principle of the control sequence scenario in the distributed lock service. The columns are numbered, and the columns are numbered. Create under a specific directory

The PERSISTENT_SEQUENTIAL node, when the creation is successful, the Watcher notifies the waiting queue, and the queue deletes the node with the smallest sequence number for consumption.

In this scenario, Zookeeper’s znode is used for message storage, the data stored by znode is the message content in the message queue, and the SEQUENTIAL sequence number is the message’s code.

No., just take out in order. Since the created node is persistent, there is no need to worry about the loss of queue messages.

13.Zookeeper data replication

As a cluster, Zookeeper provides consistent data services. Naturally, it needs to replicate data among all machines. Benefits of data replication:

Fault tolerance: If one node fails, the entire system will not stop working, and other nodes can take over its work;

Improve the system's expansion capability: distribute the load to multiple nodes, or increase the number of nodes to improve the system's load capability;

Improve performance: allow clients to access nearby nodes locally to improve access speed.

From the perspective of the transparency of client-side read and write access, there are two types of data replication cluster systems:

Write Master (WriteMaster): Submit the modification to the data to the specified node. Without this restriction, any node can be read. In this case, the client needs to read

Different from writing, commonly known as separation of reading and writing;

Write Any: Modifications to data can be submitted to any node, just like reading. In this case, the client is transparent to the color and changes of the cluster nodes.

For zookeeper, the method it uses is to write arbitrary. By adding more machines, its read throughput and responsiveness are very scalable. Writing, as the number of machines increases

Throughput performance will definitely decrease (this is also the reason it built observer), but the response performance depends on the specific implementation, whether delayed replication maintains final consistency, or is it consistent

That is, copy and respond quickly.

14.Zookeeper operation principle

The core of Zookeeper is the original subcast. This mechanism ensures synchronization between servers. The protocol that implements this mechanism is called the Zab protocol.

The Zab protocol has two modes, which are the recovery mode (primary selection) and the broadcast mode (synchronization). When the service starts or after the leader crashes, Zab goes into recovery

Mode, when the leader is elected, and most of the servers have completed synchronization with the leader's state, the recovery mode ends. State synchronization ensures that the leader and

Server has the same system state.

15. How does zookeeper ensure the consistency of the transaction sequence?

Zookeeper uses an incremental transaction Id to identify, all proposals are added with zxid when they are proposed, zxid is actually a 64-bit number

The word, high 32-bit is epoch (epoch; epoch; world; new era) used to identify whether the leader has changed. If a new leader is produced, the epoch will increase and decrease.

32 bits are used to count up. When a new production proposal is made, it will first issue a transaction execution request to other servers according to the two-stage process of the database.

If more than half of the machines can perform and can be successful, then they will start to perform.

16. Server status under Zookeeper

Each server has three states during its operation:

LOOKING: The current server does not know who the leader is and is searching

LEADING: The current Server is the elected leader

FOLLOWING: The leader has been elected, and the current server is synchronized with it

17. How does zookeeper choose the master leader?

When the leader crashes or the leader loses most of its followers, zk enters the recovery mode. The recovery mode needs to re-elect a new leader so that all

Servers are restored to a correct state. There are two election algorithms for Zk: one is based on basic paxos and the other is based on fast paxos.

of. The default election algorithm of the system is fast paxos.

1. Zookeeper selection process (basic paxos)

(1) The election thread is held by the thread that the current server initiates the election. Its main function is to count the voting results and select the recommended server;

(2) The election thread first initiates an inquiry to all servers (including self);

(3) After the election thread receives the reply, it verifies whether it is a self-initiated query (verifies whether the zxid is consistent), and then obtains the corresponding id (myid) and stores it in the current query object

In the list, finally get the relevant information (id, zxid) of the leader proposed by the opposite party, and store this information in the voting record table of the current election;

(4) After receiving all the server replies, calculate the server with the largest zxid, and set the server related information as the next server to vote;

(5) The thread sets the current server with the largest zxid as the leader recommended by the current server. If the winning server gets n/2 + 1 server votes, set

The currently recommended leader is the winning server, and its own state will be set according to the relevant information of the winning server. Otherwise, continue this process until the leader is elected

Come. Through process analysis, we can conclude that: in order for Leader to be supported by most servers, the total number of servers must be an odd number 2n+1, and the number of surviving servers

No less than n+1. The above process will be repeated after each server is started. In the recovery mode, if the server has just recovered from a crash state or just started, it will also be from the disk

To restore data and session information in the snapshot, zk will record the transaction log and take a snapshot regularly, so that the state can be restored when it is restored.

2. Zookeeper selection process (basic paxos)

The fast paxos process is in the election process, a server first proposes to all servers to become the leader, and when other servers receive the proposal, the epoch is resolved

Conflict with zxid, accept the proposal of the opponent, and then send a message that the proposal is accepted to the opponent, repeat the process, and finally will be able to elect a leader.

18.Zookeeper synchronization process

After selecting the Leader, zk enters the state synchronization process.

Leader waits for server connection;

Follower connects to the leader and sends the largest zxid to the leader;

The leader determines the synchronization point according to the zxid of the follower;

After the synchronization is completed, notify the follower that it has become the uptodate state;

After the follower receives the uptodate message, it can re-accept the client's request for service.

19. Distributed notification and coordination

For system scheduling: the operator sends notifications to actually change the state of a node through the console, and then zk sends these changes to the registered node

All clients of the watcher.

Reporting on the execution status: Each work process creates a temporary node under a certain directory. And carry the progress data of the work, so that the summarized process can be monitored and recorded.

The change of the node obtains the real-time global situation of the work progress.

20. Why is there a leader in the machine?

In a distributed environment, some business logic only needs to be executed by one machine in the cluster, and other machines can share the result, which can greatly reduce duplication of calculations.

Calculate, improve performance, so it is necessary to conduct leader election.

21. How to deal with zk node downtime?

Zookeeper itself is also a cluster, and it is recommended to configure no less than 3 servers. Zookeeper also needs to ensure that when one node goes down, other nodes will continue to provide services.


If one follower goes down, there are still 2 servers to provide access, because the data on Zookeeper has multiple copies, and the data will not be lost;

If a leader goes down, Zookeeper will elect a new leader.

The mechanism of ZK cluster is that as long as more than half of the nodes are normal, the cluster can provide services normally. Only when the ZK node hangs too much, only half or less than half of the nodes can work,

The cluster fails.

and so

A cluster of 3 nodes can suspend 1 node (the leader can get 2 votes> 1.5)

A cluster of 2 nodes cannot suspend any node (the leader can get 1 vote<=1)

22. The difference between zookeeper load balancing and nginx load balancing

The load balance of zk can be adjusted, nginx can only adjust the weight, other controllable need to write plug-ins; but the throughput of nginx is much larger than zk, it should be said

Which way to choose the business.

23.zookeeper watch mechanism

The Watch mechanism official stated: A Watch event is a one-time trigger. When the data set with Watch changes, the server will change this

The changes are sent to clients who have set up Watch to notify them.

Features of Zookeeper mechanism:

1. When a data change is triggered once, a watcher event will be sent to the client, but the client will only receive such information once.

2. Watcher event asynchronously sends watcher notification events from server to client. It is asynchronous. There is a problem. Different clients and servers are different.

Communication is carried out through sockets. Due to network delay or other factors, the client listens to the event at the time when the connection is unavailable. Because of the ordering provided by Zookeeper itself

Guarantee, that is, after the client listens to the event, it will perceive changes in the znode it monitors. Therefore, we cannot expect to be able to monitor every node when using Zookeeper.

Times of change. Zookeeper can only guarantee final consistency, but cannot guarantee consistency.

3. Data monitoring Zookeeper has data monitoring and sub-data monitoring getdata() and exists() to set up data monitoring, getchildren() to set up sub-node monitoring.

4. Register watcher getData, exists, getChildren

5. Trigger watcher create, delete, setData

6. setData() will trigger the data watch set on the znode (if the set is successful). A successful create() operation will trigger the data on the created znode

According to watch, and child watch on other nodes. A successful delete() operation will trigger a znode's data watch and child at the same time

watch (because there are no subnodes), it will also trigger the child watch of its node.

7. When a client connects to a new server, watch will be triggered by any session event. When the connection with a server is lost, it is unable to receive

To watch. When the client reconnects, if necessary, all previously registered watches will be re-registered. Usually this is completely transparent. only at

In a special case, the watch may be lost: for an exist watch of an uncreated znode, if it is created during the disconnection of the client, and subsequently

It was deleted before the client was connected. In this case, the watch event may be lost.

8. Watch is lightweight, it is actually the callback of the local JVM. The server only stores whether there is a boolean type with Watcher set.

Zookeeper: Detailed explanation of distributed architecture, detailed explanation of distributed technology, distributed transaction

1. Detailed explanation of distributed architecture

1. Distributed development history

1.1 Single point centralized

Features: App, DB, and FileServer are all deployed on one machine. And fewer visit requests

1.2 Application service and data service split

Features: App, DB, and FileServer are separately deployed on separate servers. And fewer visit requests

1.3 Use cache to improve performance

Features: Frequently accessed data in the database is stored in the cache server, reducing the number of database visits and reducing the pressure on the database

1.4 Application server cluster

Features: Multiple application servers provide external services at the same time through load balancing, solving the problem of the maximum processing capacity of a single server

1.5 Database read and write separation

Features: The database is designed with read-write separation (master-slave) to solve the processing pressure of the database

1.6 Reverse proxy and CDN acceleration

Features: Use reverse proxy and CDN to speed up system access

1.7 Distributed file system and distributed database

Features: The database uses a distributed database, and the file system uses a distributed file system

With the development of business, the final separation of database reading and writing will not be able to meet the demand. Distributed database and distributed file system are needed to support it.

Distributed database is the last method after the database is split. It is only used when the scale of a single table is very large. The more commonly used database split segment is business sub-database, which will not be used.

The databases of the same business are deployed on different machines

⼆ Detailed explanation of distributed technology

1. Concurrency

2. Distribution

The large task is split into multiple tasks and deployed to multiple machines to provide external services

3. Lack of global clock

Time must be unified

4. Equivalence

A service deployed on multiple machines is the same, there is no difference

5. Malfunctions will definitely occur

The hard disk is broken and the CPU is burned...

Three, distributed transactions


Atomicity: All operations in a transaction are either completed or not completed at all, and will not end in some middle link.

If an error occurs during the execution of the transaction, it will be restored (Rollback) to the state before the transaction started, as if the transaction had never been executed.

Consistency: Before the start of the transaction and after the end of the transaction, the integrity of the database is not destroyed. The information written in this statement must be in full compliance with all

The default rules for the data, including the accuracy and continuity of the data, and the subsequent database can automatically complete the predetermined tasks.

If A has 500 yuan, B has 300 yuan, and A transfers 100 to B, no matter what, the sum of A and B is always 800 yuan.

Isolation: The database allows multiple concurrent transactions to read, write and modify its data at the same time. Isolation can prevent concurrent execution of multiple transactions.

Cross-execution results in inconsistent data. Transaction isolation is divided into different levels, including read uncommitted (Read uncommitted), read commit (read

committed), repeatable read and serializable.

Durability: After the transaction is completed, the modification of the data is permanent, and it will not be lost even if the system fails.

2. 2P/3P

2P = Two Phase commit (RDBMS (Relational Database Management System) is often this mechanism to ensure consistency)

3P = Three Phase commit

Note: 2P/3P is to ensure the ACID of the transaction (original, consistency, isolation, durability)

2.1 Two stages of 2P

Phase 1: Submit a transaction request (voting phase) to ask whether the transaction can be submitted

Phase 2: Perform transaction commit (commit, rollback) The real commit transaction

2.2 The three stages of 3P

Phase 1: Whether to submit-ask if you can do transaction submission

Phase 2: Pre-commit-pre-commit transaction

Phase 3: Perform transaction commit (commit, rollback) to actually commit the transaction

Explanation: 3P splits the 2P stage into the first two stages

3. CAP theory

Consistency: The data of the distributed database remains consistent

Availability (Availability): any node is down, other nodes can continue to provide services to the outside world

Partition tolerance (network partition) Partition tolerance: The machine where the database is located is broken. For example, if the hard disk is broken and the data is lost, a new machine can be added.

Then synchronize the backup data from other normal machines

The characteristics of CAP theory: CAP can only satisfy 2 of them

CA (abandon P): put all data on a node. Satisfying consistency and usability.

AP (Abandon C): Give up the consistency and use the final consistency to ensure.

CP (Abandon A): Once the system encounters a failure, the affected server needs to wait for a period of time and cannot provide external services during the recovery period.

To illustrate the CAP theory:

There are 3 machines, 3 databases, 2 tables, and the data is the same

Machine1-db1-tbl_person, tbl_order

Machine2-db2-tbl_person, tbl_order

Machine3-db3-tbl_person, tbl_order

1) When inserting data into the tables tbl_person and tbl_order of db1 of machine1, the inserted data should be synchronized to machine2 and machine3 at the same time.

Is consistent

2) When one of the machines is down, you can continue to provide services to the outside world, and restart the down machine to continue the service. This is the availability

3) When the machine of machine1 is broken and all the data is lost, there will be no problem, because there are still data on machine2 and machine3, add another machine

Machine4, synchronize the backup data of one of machine2 and machine3. This is the partition fault tolerance

4. BASE theory

Basically available (bascially available), soft state (soft state), eventually consistent (Eventually consistent)

Basic usability: When the distributed system fails, part of the usability is allowed to be lost (service degradation, page degradation)

Soft state: Allow intermediate states in distributed systems. And the intermediate state does not affect the usability of the system.

1. This intermediate state refers to the final consistency that the data update between different data replications can be delayed.

2. As in the example of CAP theory, when inserting data into tables tbl_person and tbl_order of db1 of machine1, the inserted data should be synchronized to

Machine2, machine3, when there is a problem with the network of machine3, the synchronization fails, but after a while the network is restored, the synchronization is successful, and the synchronization fails.

It is called the soft state, because the synchronization is successful in the end.

Ultimate consistency: data replications reach consistency over a period of time.

5. Paxos algorithm

5.1 Before introducing the Paxos algorithm, let's take a look at a small story

Byzantine Generals Question

The Byzantine Empire was the Eastern Romanian Empire from the 5th to 15th centuries, and Byzantium was now Istanbul. We can imagine that the Byzantine army has many divisions, stationed in the enemy

Outside the city, every minute is commanded by its own generals. Suppose there are 11 generals, and the generals can only communicate with correspondents. After observing the enemy, loyal generals must control

Make a unified plan of action-attack or retreat. However, these generals have traitors, and they don’t want loyal generals to reach an agreement, because it affects the unification

Formulation and dissemination of plans.

The problem is: the generals must have an agreement so that all loyal generals can reach an agreement, and a small number of traitors cannot make the loyal generals make the wrong plan

-Make some generals attack and others retreat.

Suppose there are 9 loyal generals, 5 judges to attack, 4 judges to retreat, and 2 spies maliciously judge to retreat. Although the result is a wrong retreat, this situation is completely allowed.

Xu's. Because these 11 generals still maintain consistency.

to sum up:

1) 11 generals attacked the city

2) Simultaneous attack (proposal, resolution), simultaneous retreat (proposal, resolution)

3) Regardless of retreat or offensive, half of the generals must be unanimous before they can execute

4) The general has a traitor, which will interfere with the resolution

5.2 Let's introduce the Paxos algorithm below

Mike Burrows, the author of Google Chubby, said that there is only one consistent algorithm in the world, and that is Paxos. Other algorithms are defective.

Paxos: Majority resolution (final solution to consistency issues)

Paxos algorithm has three colors: Proposer, Acceptor, Learner

Proposer: Submitter (proposal submitter)

Submit the bill (judge whether it is more than half), submit the approval bill (judge whether it is more than half)

Acceptor: Recipient (Recipient of the proposal)

Accept the proposal or reject the proposal and respond to the proposer (promise)

Learner: learner (soy sauce)

If the bill is produced, study the bill.

Setting 1: If the Acceptor does not accept the proposal, then he must accept the first proposal

Setting 2: Each proposal must have a number, and the number can only be increased and cannot be repeated. It gets bigger as you go back.

Setting 3: Accept the proposal with a large number, if the proposal number is lower than before, then it will not be accepted

Setting 4: There are two kinds of bills (submitted bills, approved bills)

1) Prepare stage (proposal submission)

a) Proposer wants proposal V. First issue a Prepare request to most Acceptors. The content of the Prepare request is sequence number K

b) After the Acceptor receives the Prepare request with the number K, it checks whether the Prepare request has been processed by itself.

c) If the Acceptor has not accepted any Prepare request, then use OK to reply to the Proposer, which means that the Acceptor must accept the first proposal received (set


d) Otherwise, if the Acceptor has previously accepted any Prepare request (such as MaxN), then compare the bill number, if K<MaxN, use reject or error

Reply to Proposer

e) If K>=MaxN, then check whether there is an approved proposal before, if not, use OK to reply to Proposer and record K

f) If K>=MaxN, then check whether there is an approved motion before, and if there is, reply the approved motion number and motion content (such as: <AcceptN, AcceptV>,

AcceptN is the number of the approved motion, AcceptV is the content of the approved motion)

2) Accept phase (approval phase)

a) Proposer received more than half of the replies from Acceptor, and the replies were all OK, and there were no approved motion numbers and motion content attached. Then Proposer continues

Submit the request for approval, but at this time, the proposal number K and the proposal content V will be submitted together (<K, V> this data format)

b) Proposer received more than half of the replies from Acceptor, and the replies were all OK with the approved proposal number and proposal content (<pok, proposal number, proposal content)

>). Then Proposer finds the one with more than half of all replies (assumed to be <pok, AcceptNx, AcceptVx>) as the submission approval request (the request is

<K, AcceptVx>) is sent to Acceptor.

c) Proposer did not receive more than half of the response from Acceptor, then modify the proposal number K to K+1, and resend the number to Acceptors (repeat the Prepare stage

the process of)

d) Acceptor receives the Accept request from Proposer, if the number K<MaxN, it will not respond or reject.

e) Acceptor receives the Accept request from Proposer, if the number K>=MaxN, approve the proposal, and set the approved proposal as <K, accept the proposal

No., accept the content of the proposal>, reply to Proposer.

f) After a period of time, Proposer responds to the Accept response received by the user. If more than half of the accepted response is received, the process will be terminated (representative proposal is approved), and Leaner will be notified at the same time.

Study the motion.

g) After a period of time, the Proposer responds to the Accept response received by the user. If it does not exceed half of the number, the proposal number will be modified to enter the Prepare stage again.

tips: Welcome to follow WeChat official account: Java backend for more push notifications.

5.3 Paxos example

Example 1: The proposed scenario


proposer: Staff 1, Staff 2

Acceptor: General 1, General 2, General 3 (decision maker)

1) Staff 1 initiates a proposal and sends a signal corps to 3 generals with the content (No. 1);

2) The 3 generals received the suggestion of Staff 1, because they have not saved any serial number before, so save (No. 1) to avoid forgetting; at the same time, let the signal soldier bring the letter back.

The content is (ok);

3) Staff 1 received replies from at least 2 generals, and once again sent signal soldiers to bring letters to 3 generals, with the content (No. 1, attack time 1);

4) When the 3 generals received the time of staff 1, save (No. 1, attack time 1) to avoid forgetting; at the same time, let the correspondents take the letter back with the content (Accepted);

5) Staff 1 received at least 2 generals (Accepted) content, confirming that the attack time has been accepted by the generals;

6) Staff 2 initiated a proposal and sent a signal corps to 3 generals with the content (No. 2);

7) 3 generals received the proposal of Staff 2, because (No. 2) is larger than (No. 1), so save (No. 2) to avoid forgetting; because they have accepted the staff before

The proposal of 1, so let the signal soldier take the letter back, the content is (number 1, attack time 1);

8) Staff 2 received replies from at least 2 generals. Since the reply brought the content of the proposal of Staff 1 that had been accepted, Staff 2 did not propose a new attack time and accepted the staff.

1 the time of proposal;

Example 2: Crossing scene


proposer: Staff 1, Staff 2

Acceptor: General 1, General 2, General 3 (decision maker)

1) Staff 1 initiates a proposal and sends a signal corps to 3 generals with the content (No. 1);

2) The situation of the 3 generals is as follows

a) General 1 and General 2 received the proposal of Staff 1, and General 1 and General 2 recorded (No. 1). If other staff members propose a smaller number, they will be rejected; at the same time, let

The messenger takes the letter back, and the content is (ok);

b) The signal soldier responsible for notifying General 3 was arrested, so General 3 did not receive the proposal of Staff 1;

3) Staff 2 also initiated a proposal at the same time, sending signal soldiers to bring letters to 3 generals, with the content (No. 2);

4) The situation of the 3 generals is as follows

a) General 2 and General 3 received the proposal of Staff 2, and General 2 and General 3 recorded (No. 2). If other staff members propose a smaller number, they will be rejected;

The messenger takes the letter back, and the content is (ok);

b) The signal officer responsible for notifying General 1 was arrested, so General 1 did not receive the proposal of Staff 2;

5) Staff 1 received replies from at least 2 generals, and once again sent a messenger to the 2 generals who responded, with the content (No. 1, attack time 1);

6) The situation of the 2 generals is as follows

a) General 1 received (No. 1, attack time 1), which is the same as the number saved by itself, so save (No. 1, attack time 1); at the same time, let the communicator carry the message

Go back, the content is (Accepted);

b) General 2 received (No. 1, attack time 1), because (No. 1) is smaller than the saved (No. 2), so let the correspondent take the letter back, the content is

(Rejected, No. 2);

7) Staff 2 received replies from at least 2 generals, and once again sent a messenger to bring a letter to the 2 generals who responded with the content (No. 2, attack time 2);

8) General 2 and General 3 received (No. 2, attack time 2), which is the same as the number saved by themselves, so save (No. 2, attack time 2) and let the communication

Bing took a letter back, and the content was (Accepted);

9) Staff 2 received at least 2 generals (Accepted) content, confirming that the attack time has been accepted by the majority;

10) Staff 1 only received the content of 1 general (Accepted), and at the same time received one (Rejected, No. 2); Staff 1 re-initiated the proposal and sent a messenger to bring a letter to

3 generals, the content is (No. 3);

11) The situation of the 3 generals is as follows

a) General 1 received the proposal of Staff 1, because (No. 3) is larger than the previously saved (No. 1), so save (No. 3); because General 1 has accepted Staff 1

The previous proposal, so let the signal soldier take the letter back, the content is (No. 1, attack time 1);

b) General 2 received the proposal of Staff 1, because (No. 3) is larger than the previously saved (No. 2), so save (No. 3); because General 2 has accepted Staff 2

The proposal, so let the signal soldier take the letter back, the content is (No. 2, attack time 2);

c) The signal soldier responsible for notifying General 3 was arrested, so General 3 did not receive the proposal of Staff 1;

12) Staff 1 received replies from at least 2 generals. Compared with the numbers of the two replies, choose the time of attack corresponding to the number as the latest proposal; Staff 1 dispatched again

The letter soldier brought a letter to the 2 generals who responded, with the content (No. 3, attack time 2);

13) General 1 and General 2 received (No. 3, attack time 2), which is the same as the number saved by themselves, so save (No. 3, attack time 2), and let the communicator bring

Letter back, the content is (Accepted);

14) Staff 1 received at least 2 generals (accepted) content, confirming that the attack time has been accepted by the majority.

4. Zookeeper ZAB protocol

Zookeeper Automic Broadcast (ZAB), the original Zookeeper broadcast, is the classic implementation of Paxos

the term:

quorum: a collection of more than half of the cluster

1. The nodes in ZAB (zookeeper) are divided into four states

looking: The state of the election leader (under crash recovery state)

following: follower (follower) status, obey Leader command

leading: The current node is Leader, responsible for coordinating the work.

observing: observer (observer), does not participate in the election, read-only node.

2. Two modes in ZAB (how does ZK conduct elections)

Crash recovery, news broadcast

1) Crash recovery

The leader is dead, and a new leader needs to be elected

a. Each server has a ballot, such as (3,9), and the ballot is voted by itself.

b. After each server has finished voting, it is then voted separately to other available servers. For example, vote (3, 9) of Server3 to Server4 and Server5, and so on.

c. Compared with voting, it is more logical: Zxid is preferred, and myid is compared when Zxid is the same. When compared to Zxid, be the leader; when compared to myid, be the leader

d. Change the server status (crash recovery->data synchronization, or crash recovery->message broadcast)

Supplementary explanation of related concepts:

epoch period value

acceptedEpoch (synonym: year number): The follower has accepted the leader's proposal to change the year number (newepoch).

currentEpoch (anaphora: the current era): the current era

lastZxid: The zxid of the most recently received proposal in history (the highest value)

history: The log of the transaction proposal received by the current node

Zxid data structure description:

cZxid = 0x10000001b

64-bit data structure

High 32 bits: 10000

Leader's cycle number + myid combination

Low 32 bits: 001b

The self-increasing sequence of the transaction (monotonically increasing sequence) as long as the client has a request, +1

When a new leader is produced, the most significant transaction Zxid in the local log is taken from the leader server, and epoch+1 is read from the bottom as a new epoch, and

Set the low 32 position to 0 (guarantee id is absolutely incremented)

2) News broadcast (similar to 2P submission)

a. After the Leader accepts the request, it assigns the request to the global unique 64-bit auto-incremented Id (zxid).

b. Send zxid as a proposal to all followers.

c. After all the followers accept the proposal and want to write the proposal to the hard disk, they will reply to the leader with an ACK (OK).

d. When the leader receives the legal number (more than half) of Acks, the leader sends a commit command to all followers.

e.follower executes the commit command.

Note: At this stage, the ZK cluster officially provides services to the outside world, and the leader can broadcast messages. If a new node is added, it needs to be synchronized.

3) Data synchronization

a. Take out the leader lastZxid (from the local log)

b. Find the data corresponding to zxid and synchronize (the data synchronization process ensures that all followers are consistent)

c. Only when the full quorum is completed synchronously, can the quasi-leader become the real leader