Interviewer: How do you diagnose the bottleneck where Kafka messages are sent to (a target is the correct way to open performance optimization)

Mastering one or two mainstream java middlewares is a necessary skill for big manufacturers such as BAT, giving you a Java middleware learning path to help you achieve the transformation of the workplace.

Java advanced ladder, growth path and learning materials, helping to break through the middleware field

Is there a way to correctly assess where the bottleneck is when the message sending end encounters a performance bottleneck? How to tune it in a targeted manner?

1. Monitoring indicators of Kafka message sender

In fact, Kafka has long considered it for us. Kafka provides a wealth of monitoring indicators and provides a JMX method to obtain these monitoring indicators. The monitoring indicators provided on the client side are shown in the following figure: The

Insert picture description here

main monitoring indicators are classified as follows:

  • producer-metrics
    The monitoring metrics of the message sender, and its child nodes are all producers in the process
  • The producer-node-metrics
    takes Broker node as the dimension, and the data metrics of each sender.
  • The producer-topic-metrics
    uses topic as the dimension and counts some metrics of the sender.
There are many indicators related to Kafka Producer, and this article will not list them all.

1.1 producer-metrics

Producer-metrics is a very important monitoring item at the sender, as shown in the figure below:

Insert picture description here

Its key items are explained as follows:

The average size of a batch (ProducerBatch) when the Sender thread actually sends a message.

The maximum size of a batch when the Sender thread sends a message.

Practical guidance: I personally feel that these two parameters are very necessary to collect. If the value is much smaller than the value set by batch.size, if the throughput is not up to expectation, you can increase appropriately.

Kafka provides a mechanism to split a large ProducerBatch into small ones, that is, if the client's ProducerBatch exceeds the maximum message size allowed by the server, it will trigger the split and resend on the client. This value records the split per second s speed

The number of splits in Kafka.

Reminder: According to the author's reading of this part of the source code, I think the split of ProducerBatch is of little significance, because the capacity of the newly allocated ProducerBatch will be equal to batch.size. If it does not exceed the size, the batch will not be divided. I think This function may not be able to complete the actual cutting intention.
Practical guidance : If the value is not 0, it means that the message size set by the server and the client is unreasonable. The batch.szie size set by the client should be smaller than the max.message.bytes set by the server, and the default value is 100W bytes ( (Approximately equal to 1M)

buffer-available-bytes The
size of available bytes in the buffer area of ​​the current sender.

buffer-total-bytes The
total buffer size of the sender, the default is 32M, 33,554,432 bytes.

Practical guidance: If the number of remaining bytes in the buffer area continues to be low, it is necessary to evaluate whether the buffer area size is appropriate, and the Sender thread has encountered a bottleneck, so as to consider whether the network and Brorker have encountered a bottleneck.


The bufferpool-wait-time-total
client requests memory from the buffer area to create the total time blocked by the ProducerBatch.

Practical guidance: If the value is continuously greater than 0, it indicates that there is a bottleneck in sending. You can appropriately reduce the value of to give the message a chance to be processed in a more timely manner.

The average time that the produce-throttle-time-avg message is sent and is limited by the broker

The maximum time that the produce-throttle-time-max message is sent is limited by the broker

IO thread processing the total time of IO read and write

The average time (in nanoseconds) for each event selector to call an IO operation

The average time (in nanoseconds) that the io thread waits for the read and write to be ready

iotime-total Total
io processing time.

The network-io-rate
client reads and writes tps per second on all connected networks.

network-io-total The total
number of network operations (read or write) on all client connections.

1.2 General indicators

In addition to the above indicators on the message sender, Kafka also has some general monitoring indicators. The statistical dimensions of such indicators include three dimensions: message sender, node, and TOPIC.

Insert picture description here

The main dimensions are explained as follows:

  • producer-metics
    sender dimension
  • producer-node-metrics
    sender-broker node dimensions
  • producer-topic-metrics
    sender-topic dimension statistics

The indicators described below are counted in different dimensions, but their meanings are the same, so they will be explained in a unified way.

The incoming traffic per second, the number of incoming bytes per second.

incoming-byte-total The total
number of incoming bytes.

outgoing-byte-total The
total number of bytes sent.

The average latency of request-latency-avg message sending.


The maximum delay time for message sending.

Practical guidance: latency-avg and max can reflect the delay performance of message sending. If the delay is too high, it means that the sender thread has a bottleneck in sending messages. It is recommended to compare this value with If the value is significantly less than, To improve the throughput rate, you can adjust the size of batch.size appropriately .

request-rate to
send Tps per second

The average size of request-size-avg messages sent.

The maximum size of a single message sent by the Sender thread.

Practical guidance: If the value is less than max.request.size, it means that there are not many messages in the client message backlog. If bottlenecks are encountered from other dimensions, and batch.size can be appropriate to effectively improve throughput.

The total number of bytes sent by the request

receives server response TPS per second

response-total The
total number of responses received from the server.

2. Monitoring index collection

Although Kafka has many built-in monitoring indicators, these indicators are stored in memory by default. Since they are stored in memory, in order to avoid the endless increase of monitoring data to trigger memory overflow, usually the storage of monitoring data is basically based on sliding windows . That is, only the monitoring data in the most recent period of time will be stored for rolling coverage.

Therefore, in order to display these indicators more intuitively, because the information needs to be collected regularly and stored in persistent storage such as other databases, you can draw a curve based on historical data. The desired effect is shown in the following figure:

Insert picture description here

Basic monitoring collection system architecture The design is shown in the figure below:

Insert picture description here

mq-collect should be placed in the producer SDK, the collected information will be uploaded to the timing database InfluxDB through the mq-collect library asynchronously and regularly, and then the page will be displayed through the mq-portal portal, for each production customer The terminal performs visual display based on indicators to realize the visualization of monitoring data, thereby providing a basis for performance optimization.

Okay, that's the end of this article. One-click triple connection (follow, like, leave a message) is my biggest encouragement .

Mastering one or two mainstream java middlewares is a necessary skill for big manufacturers such as BAT, giving you a Java middleware learning path to help you achieve the transformation of the workplace.

Java advanced ladder, growth path and learning materials, helping to break through the middleware field

Finally, I share a hard-core RocketMQ e-book by the author, and you will get hundreds of billions of information flow operation and maintenance experience.

Insert picture description here

How to get: You can get it by replying to RMQPDF by private message.

Personal website: