leandromoreira / linux-network-performance-parameters
- среда, 24 апреля 2019 г. в 00:17:37
Learn where some of the network sysctl variables fit into the Linux/Kernel network flow
Sometimes people are looking for sysctl cargo cult values that bring high throughput and low latency with no trade-off and that works on every occasion. That's not realistic, although we can say that the newer kernel versions are very well tuned by default. In fact, you might hurt performance if you mess with the defaults.
This brief tutorial shows where some of the most used and quoted sysctl/network parameters are located into the Linux network flow, it was heavily inspired by the illustrated guide to Linux networking stack and many of Marek Majkowski's posts.
Feel free to send corrections and suggestions! :)
MAC (if not on promiscuous mode) and FCS and decide to drop or to continuerx until rx-usecs timeout or rx-frameshard IRQIRQ handler that runs the driver's codeschedule a NAPI, clear the hard IRQ and returnsoft IRQ (NET_RX_SOFTIRQ)netdev_budget_usecs timeout or netdev_budget and dev_weight packetssk_buffnetif_receive_skb)skb to taps (i.e. tcpdump) and pass it to tc ingressnetdev_max_backlog with its algorithm defined by default_qdiscip_rcv and packets are handled to IPPREROUTING)LOCAL_IN)tcp_v4_rcv)tcp_rmem rules
tcp_moderate_rcvbuf is enabled kernel will auto-tune the receive buffersendmsg or other)tcp_wmem sizeipv4 on tcp_write_xmit and tcp_transmit_skb)ip_queue_xmit) does its work: build ip header and call netfilter (LOCAL_OUT)POST_ROUTING)ip_output)dev_queue_xmit)txqueuelen length with its algorithm default_qdiscring buffer txsoft IRQ (NET_TX_SOFTIRQ) after tx-usecs timeout or tx-frameshard IRQ to signal its completionsoft IRQ) the NAPI poll systemethtool -g ethXethtool -G ethX rx value tx valueethtool -S ethX | grep -e "err" -e "drop" -e "over" -e "miss" -e "timeout" -e "reset" -e "restar" -e "collis" -e "over" | grep -v "\: 0"ethtool -c ethXethtool -C ethX rx-usecs value tx-usecs valuecat /proc/interruptsnetdev_budget_usecs have elapsed during the poll cycle or the number of packets processed reaches netdev_budget.dropped (# of packets that were dropped because netdev_max_backlog was exceeded) and squeezed (# of times ksoftirq ran out of netdev_budget or time slice with work remaining).sysctl net.core.netdev_budget_usecssysctl -w net.core.netdev_budget_usecs valuecat /proc/net/softnet_stat; or a better toolnetdev_budget is the maximum number of packets taken from all interfaces in one polling cycle (NAPI poll). In one polling cycle interfaces which are registered to polling are probed in a round-robin manner. Also, a polling cycle may not exceed netdev_budget_usecs microseconds, even if netdev_budget has not been exhausted.sysctl net.core.netdev_budgetsysctl -w net.core.netdev_budget valuecat /proc/net/softnet_stat; or a better tooldev_weight is the maximum number of packets that kernel can handle on a NAPI interrupt, it's a Per-CPU variable. For drivers that support LRO or GRO_HW, a hardware aggregated packet is counted as one packet in this.sysctl net.core.dev_weightsysctl -w net.core.dev_weight valuecat /proc/net/softnet_stat; or a better toolnetdev_max_backlog is the maximum number of packets, queued on the INPUT side (the ingress qdisc), when the interface receives packets faster than kernel can process them.sysctl net.core.netdev_max_backlogsysctl -w net.core.netdev_max_backlog valuecat /proc/net/softnet_stat; or a better tooltxqueuelen is the maximum number of packets, queued on the OUTPUT side.ifconfig ethXifconfig ethX txqueuelen valueip -s linkdefault_qdisc is the default queuing discipline to use for network devices.sysctl net.core.default_qdiscsysctl -w net.core.default_qdisc valuetc -s qdisc ls dev ethXtcp_rmem - min (size used under memory pressure), default (initial size), max (maximum size) - size of receive buffer used by TCP sockets.sysctl net.ipv4.tcp_rmemsysctl -w net.ipv4.tcp_rmem="min default max"; when changing default value, remember to restart your user space app (i.e. your web server, nginx, etc)cat /proc/net/sockstattcp_wmem - min (size used under memory pressure), default (initial size), max (maximum size) - size of send buffer used by TCP sockets.sysctl net.ipv4.tcp_wmemsysctl -w net.ipv4.tcp_wmem="min default max"; when changing default value, remember to restart your user space app (i.e. your web server, nginx, etc)cat /proc/net/sockstattcp_moderate_rcvbuf - If set, TCP performs receive buffer auto-tuning, attempting to automatically size the buffer.sysctl net.ipv4.tcp_moderate_rcvbufsysctl -w net.ipv4.tcp_moderate_rcvbuf valuecat /proc/net/sockstatsysctl net.core.somaxconn - provides an upper limit on the value of the backlog parameter passed to the listen() function, known in userspace as SOMAXCONN. If you change this value, you should also change your application to a compatible value (i.e. nginx backlog).cat /proc/sys/net/ipv4/tcp_fin_timeout - this specifies the number of seconds to wait for a final FIN packet before the socket is forcibly closed. This is strictly a violation of the TCP specification but required to prevent denial-of-service attacks.cat /proc/sys/net/ipv4/tcp_available_congestion_control - shows the available congestion control choices that are registered.cat /proc/sys/net/ipv4/tcp_congestion_control - sets the congestion control algorithm to be used for new connections.cat /proc/sys/net/ipv4/tcp_max_syn_backlog - sets the maximum number of queued connection requests which have still not received an acknowledgment from the connecting client; if this number is exceeded, the kernel will begin dropping requests.cat /proc/sys/net/ipv4/tcp_syncookies - enables/disables syn cookies, useful for protecting against syn flood attacks.cat /proc/sys/net/ipv4/tcp_slow_start_after_idle - enables/disables tcp slow start.How to monitor:
netstat -atn | awk '/tcp/ {print $6}' | sort | uniq -c - summary by statess -neopt state time-wait | wc -l - counters by a specific state: established, syn-sent, syn-recv, fin-wait-1, fin-wait-2, time-wait, closed, close-wait, last-ack, listening, closingnetstat -st - tcp stats summarynstat -a - human-friendly tcp stats summarycat /proc/net/sockstat - summarized socket statscat /proc/net/tcp - detailed stats, see each field meaning at the kernel docscat /proc/net/netstat - ListenOverflows and ListenDrops are important fields to keep an eye on
cat /proc/net/netstat | awk '(f==0) { i=1; while ( i<=NF) {n[i] = $i; i++ }; f=1; next} \ (f==1){ i=2; while ( i<=NF){ printf "%s = %d\n", n[i], $i; i++}; f=0} ' | grep -v "= 0; a human readable /proc/net/netstat
Source: https://commons.wikimedia.org/wiki/File:Tcp_state_diagram_fixed_new.svg