Summary of Linux TCP queue related parameters

tcpip, accept, 11 states, the secrets of the minutiae, and what else you don’t know

Handwriting a user-mode network protocol stack, instantly improve your network skills

c/c++ linux server development and learning address: c/c++ linux background server senior architect

When optimizing the performance of network applications on Linux, TCP-related kernel parameters are generally adjusted, especially parameters related to buffers and queues. Many articles will tell you which parameters you need to modify, but we often know it but don’t know why. After copying it every time, we may quickly forget or confuse their meaning.

From the perspective of the server side, I will classify the parameters from the three paths of connection establishment, data packet reception and data packet sending.

One, connection establishment

Insert picture description here

Simply look at the connection establishment process. The client sends a SYN packet to the server, and the server replies with SYN+ACK, and at the same time saves the connection in the SYN_RECV state to the semi-connection queue. The client returns an ACK packet to complete the three-way handshake, and the server moves the connection in the ESTABLISHED state to the accept queue and waits for the application to call accept(). You can see that two queues are involved in establishing a connection:

  • Semi-connection queue, save the connection in the SYN_RECV state. The queue length is set by net.ipv4.tcp_max_syn_backlog
  • The accept queue saves the connection in the ESTABLISHED state. The queue length is min (net.core.somaxconn, backlog). The backlog is the parameter specified when we created the ServerSocket (intport, int backlog), and it will eventually be passed to the listen method: #include int listen(int sockfd, int backlog); If the backlog we set is greater than net.core.somaxconn, the accept queue The length will be set to net.core.somaxconn

In addition, in order to deal with SYNflooding (that is, the client only sends a SYN packet to initiate a handshake without responding to ACK to complete the connection establishment, filling the semi-connection queue on the server side, making it unable to process normal handshake requests), Linux implements a kind of SYNcookie called SYNcookie Mechanism, controlled by net.ipv4.tcp_syncookies, set to 1 to enable. Simply put, SYNcookie is to encode the connection information in ISN (initialsequencenumber) and return it to the client. At this time, the server does not need to save the semi-connection in the queue, but uses the ISN returned by the client's subsequent ACK to restore the connection information. In order to complete the establishment of the connection, the semi-connection queue is prevented from being filled up by attacking SYN packets. For the client handshake that is never returned, just ignore it.

Second, the reception of data packets

Let's take a look at the path that the received data packet passes: the reception of the data packet passes

Insert picture description here

through three layers from bottom to top: network card driver, system kernel space, and finally to the application of user space. The Linux kernel uses the sk_buff (socketkernel buffers) data structure to describe a data packet. When a new packet arrives, the NIC (network interface controller) calls DMAengine to place the packet into the kernel memory area through RingBuffer. The size of the RingBuffer is fixed, it does not contain the actual data packet, but contains a descriptor pointing to sk_buff. When the RingBuffer is full, new packets will be discarded. Once the data packet is successfully received, the NIC initiates an interrupt, and the kernel's interrupt handler passes the data packet to the IP layer. After processing at the IP layer, the data packet is put into a queue for processing at the TCP layer. Each data packet goes through a series of complicated steps at the TCP layer, updates the TCP state machine, and finally arrives at recvBuffer, waiting to be received and processed by the application. One thing to note is that when the data packet arrives in recvBuffer, TCP will return an ACK confirmation. The ACK of TCP indicates that the data packet has been received by the operating system kernel, but it does not ensure that the application layer must receive the data (for example, this time the system crashes). Therefore, it is generally recommended that the application protocol layer also design its own confirmation mechanism.

[Article benefits] C/C++ Linux server architect learning materials plus group 812855908 (data including C/C++, Linux, golang technology, Nginx, ZeroMQ, MySQL, Redis, fastdfs, MongoDB, ZK, streaming media, CDN, P2P, K8S, Docker, TCP/IP, coroutine, DPDK, ffmpeg, etc.) The

Insert picture description here

above is a fairly simplified data packet receiving process, let us look at the parameters related to the queue buffer layer by layer.

1. NIC Bonding mode

When the host has more than one network card, Linux will bind multiple network cards into a virtual bonded network interface. For TCP/IP, there is only one bonded network card. Multi-network card binding can improve network throughput on the one hand, and can also enhance high network availability on the other hand. Linux supports 7 bonding modes:

For detailed instructions, refer to the kernel document LinuxEthernet Bonding Driver HOWTO. We can
check the bonding mode of this machine through cat/proc/net/bonding/bond0:

Insert picture description here

Generally, development is rarely needed to set the bonding mode of the network card, and you can refer to this document for your own experiments.

  • Mode 0 (balance-rr) Round-robin strategy, this mode has load balancing and fault tolerance
  • Mode 1 (active-backup) active-backup strategy, only one network card is activated in the binding, the others are in backup
  • Mode 2 (balance-xor) XOR strategy, select the slave network card through the exclusive OR operation of the source MAC address and the destination MAC address
  • Mode 3 (broadcast) broadcast, transmit all messages on all network cards
  • Mode 4 (802.3ad) IEEE 802.3ad dynamic link aggregation. Create aggregation groups that share the same speed and duplex mode
  • Mode 5 (balance-tlb) Adaptive transmit loadbalancing
  • Mode 6 (balance-alb) Adaptive loadbalancing

2. NIC multi-queue and interrupt binding

With the continuous improvement of network bandwidth, single-core CPUs can no longer meet the needs of network cards. At this time, with the support of multi-queue network card drivers, each queue can be bound to different CPU cores through interrupts, making full use of multi-core to improve data The processing power of the package.

First check whether the network card supports multiple queues, use the lspci-vvv command to find the Ethernetcontroller item:

Insert picture description here

if there is MSI-X, Enable+ and Count> 1, then the network card is a multi-queue network card.

Then check whether the network card multi-queue is turned on. Use the command cat/proc/interrupts, if you see eth0-TxRx-0, it means that the multi-queue support has been turned on:

Insert picture description here

finally confirm whether each queue is bound to a different CPU. cat/proc/interrupts queries the interrupt number of each queue, and the corresponding file /proc/irq/${IRQ_NUM}/smp_affinity is the CPU core bound to the interrupt number IRQ_NUM. Expressed in hexadecimal, each bit represents a CPU core:

(00000001) stands for CPU0 (00000010) stands for CPU1 (00000011) stands for CPU0 and CPU1

If the binding is not balanced, you can set it manually, for example:

echo "1"> /proc/irq/99/smp_affinity echo "2"> /proc/irq/100/smp_affinity echo "4"> /proc/irq/101/smp_affinity echo "8"> /proc/irq/102 /smp_affinity echo "10"> /proc/irq/103/smp_affinity echo "20"> /proc/irq/104/smp_affinity echo "40"> /proc/irq/105/smp_affinity echo "80"> /proc/irq /106/smp_affinity

3. RingBuffer

The Ring Buffer is located between the NIC and IP layers and is a typical FIFO (first in first out) ring queue. RingBuffer does not contain the data itself, but contains a descriptor pointing to sk_buff (socketkernel buffers). You can use ethtool-g eth0 to view the current RingBuffer settings: in the

Insert picture description here

above example, the receiving queue is 4096 and the transmission queue is 256. You can observe the operating status of the receiving and transmitting queues through ifconfig:

Insert picture description here
  • RXerrors: the total number of errors received in the package
  • RX dropped: indicates that the data packet has entered the RingBuffer, but due to system reasons such as insufficient memory, it was discarded during the process of copying to the memory.
  • RX overruns: overruns means that the data packet is discarded by the physical layer of the network card before it reaches the RingBuffer, and the CPU's inability to process interrupts in time is one of the reasons that cause the RingBuffer to be full, such as uneven interrupt distribution.
    When the dropped number continues to increase, it is recommended to increase the RingBuffer and use ethtool-G to set it.

4. InputPacket Queue (data packet receiving queue)

When the rate of receiving data packets is greater than the rate of core TCP processing packets, the data packets will be buffered in the queue before the TCP layer. The length of the receiving queue is
set by the parameter net.core.netdev_max_backlog.

5. recvBuffer

The recv buffer is a key parameter for adjusting TCP performance. BDP (Bandwidth-delay product) is the product of network bandwidth and RTT (roundtrip time). The meaning of BDP is the maximum amount of unconfirmed data in transit at any time. RTT can be easily obtained using the ping command. In order to achieve the maximum throughput, the setting of recvBuffer should be greater than BDP, that is, recvBuffer >= bandwidth * RTT. Assuming that the bandwidth is 100Mbps and the RTT is 100ms, then the BDP is calculated as follows:

BDP = 100Mbps * 100ms = (100 / 8) * (100 / 1000) = 1.25MB

Linux has added recvBuffer automatic adjustment mechanism after 2.6.17. The actual size of recvbuffer will automatically float between the minimum and maximum in order to find a balance between performance and resources. Therefore, in most cases, it is not recommended to manually set recvbuffer to Fixed value.

When net.ipv4.tcp_moderate_rcvbuf is set to 1, the automatic adjustment mechanism takes effect, and the recvBuffer of each TCP connection is specified by the following 3-element array:

net.ipv4.tcp_rmem =

Initially recvbuffer is set to, and this default value will override the setting of net.core.rmem_default. Then recvbuffer dynamically adjusts between the maximum value and the minimum value according to the actual situation. When the dynamic buffer tuning mechanism is enabled, we set the maximum value of net.ipv4.tcp_rmem to BDP.

When net.ipv4.tcp_moderate_rcvbuf is set to 0, or the socket option SO_RCVBUF is set, the dynamic buffer adjustment mechanism is closed. The default value of recvbuffer is set by net.core.rmem_default, but if net.ipv4.tcp_rmem is set, the default value is overwritten. The maximum value of recvbuffer can be set to net.core.rmem_max through the system call setsockopt(). When the dynamic buffer adjustment mechanism is closed, it is recommended to set the default buffer value to BDP.

Note that there is one more detail here. In addition to storing the received data itself, the buffer also needs some space to store additional information such as the socket data structure. Therefore, the optimal value of recvbuffer discussed above is not enough to be equal to BDP, and the overhead of storing additional information such as sockets needs to be considered. Linux calculates the size of the additional overhead according to the parameter net.ipv4.tcp_adv_win_scale:

Insert picture description here

if the value of net.ipv4.tcp_adv_win_scale is 1, then one-half of the buffer space is used for the additional overhead, if it is 2, then one-quarter buffer The space is used for additional expenses. Therefore, the optimal value of recvbuffer should be set to:

Insert picture description here

Third, the sending of data packets

The path through

Insert picture description here

which the data packet is sent : Contrary to the path of the received data, the data packet is sent through three layers from top to bottom: the application of the user space, the system kernel space, and finally to the network card driver. The application first writes the data into the TCP sendbuffer, and the TCP layer constructs the data in the sendbuffer into a data packet and forwards it to the IP layer. The IP layer puts the data packets to be sent into the queue QDisc (queueingdiscipline). After the data packet is successfully put into the QDisc, the descriptor sk_buff pointing to the data packet is put into the RingBuffer output queue, and then the network card driver calls DMAengine to send the data to the network link.

Similarly, we sort out the parameters related to the queue buffer layer by layer.

1, sendBuffer

Similar to recvBuffer, the parameters related to sendBuffer are as follows: net.ipv4.tcp_wmem =
net.core.wmem_defaultnet.core.wmem_max The automatic adjustment mechanism of the sender buffer has been implemented for a long time, and it is turned on unconditionally and there is no parameter to set. If tcp_wmem is specified, net.core.wmem_default is overwritten by tcp_wmem. sendBuffer automatically adjusts between the minimum and maximum values ​​of tcp_wmem. If the socket option SO_SNDBUF is set by calling setsockopt(), the automatic adjustment mechanism of the sender buffer will be turned off, tcp_wmem will be ignored, and the maximum value of SO_SNDBUF is limited by net.core.wmem_max.

2. QDisc

QDisc (queueing discipline) is located between the IP layer and the ringbuffer of the network card. We already know that ringbuffer is a simple FIFO queue. This design keeps the driver layer of the network card simple and fast. QDisc implements advanced functions of traffic management, including traffic classification, priority and rate-shaping. You can use the tc command to configure QDisc.

The queue length of QDisc is set by txqueuelen, and the queue length for receiving packets is controlled by the kernel parameter net.core.netdev_max_backlog. txqueuelen is associated with the network card. You can use the ifconfig command to view the current size:

Insert picture description here

use ifconfig to adjust the size of txqueuelen:

ifconfig eth0 txqueuelen 2000

3. RingBuffer

Like the reception of data packets, sending data packets also need to pass through RingBuffer, and use ethtool-g eth0 to view:

Insert picture description here

The TX item is the transmission queue of RingBuffer, which is the length of the sending queue. The setting is also using the command ethtool-G.

4. TCPSegmentation and Checksum Offloading

The operating system can transfer some TCP/IP functions to the network card to complete, especially the segmentation and checksum calculations, which can save CPU resources, and the hardware instead of the OS to perform these operations will bring performance improvements. Generally, the MTU (Maximum Transmission Unit) of Ethernet is 1500 bytes. Assuming that the size of the data packet to be sent by the application is 7300 bytes, MTU 1500 bytes-IP header 20 bytes-TCP header 20 bytes = effective load is 1460 bytes, so 7300 bytes need to be split into 5 segments: The

Insert picture description here

segmentation operation can be handed over to the network card by the operating system. Although 5 packets are still transmitted on the final line, this saves CPU resources and brings performance improvements:

Insert picture description here

You can use ethtool-k eth0 to view the current offloading of the network card: In the

Insert picture description here

above example, both checksum and tcpsegmentation offloading are turned on. If you want to set the offloading switch of the network card, you can use the ethtool-K (note that K is uppercase) command. For example, the following command turns off tcp segmentation offload: sudo ethtool -K eth0 tso off

5. NIC multi-queue and NIC Bonding mode

It has been introduced in the receiving process of the data packet.

At this point, finally sorted out.