Last night a customer call us because was having issue on the server and loss of performance on the MySQL server.
When I join the bridge I ask to the customer a quick report of what was his experience and concern so far.
Luckily all the participants were technically skilled and team was compose by, SAs, DBAs, Team leader, so I was able to have a good overview in short time.
Mainly there were two fronts, one was the problem on the server in the network layer, the other in MySQL that was not able to manage efficiently the number of thread opening requests.
The machine has a single NIC, Storage attach by fibre channel, 8CPU hyper threading, 64GB RAM and finally heavy usage of NFS.
The same server was in the past using the volumes on the NFS also for MySQL, but now everything was moved to the attached storage.
As said the issue was that NIC was reporting drop packet wand MySQL was having issue to manage the number of threads, the last ones were between 200 - to 1000 connection requests.
As active threads the server was managing 200-300 threads, which was not enough.
I start reviewing the server and NIC issue, talking with the SAs they report that the NIC Receive buffer, was already set to maximum of 4096k.
So starting the investigation from there I review the back_log net.ipv4.tcp_max_syn_backlog, and all the other parameters related to TCP buffer:
CURRENT TCP buffer setting
net.ipv4.tcp_mtu_probing = 0
net.core.rmem_max = 131071
net.core.wmem_max = 131071
net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_wmem = 4096 16384 4194304
The settings were misconfigured given that the tcp value cannot override the core values.
As such the settings for the tcp auto tuning were invalid for the max limit.
Given those values were not correct for a machine supporting high traffic I suggest:
Suggested TCP buffer settings
#TCP max buffer size
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
#Linux auto-tuning TCP buffer
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
#length of the processor input queue
net.core.netdev_max_backlog = 30000
#default congestion control is htcp
About htcp see the reference to the document explaining in details the algorithm.
From the mysql point, I review few parameters that would have a direct relation with the Threads.
back_log = 1024
thread_cache_size = 512
thread_stack = 256K
wait_timeout = 600
I decide to set value of backlog as the maximum queue we have seen, move the value of the thread_cache_size from 8 to 1/4 of the max number of connection,
then few correction given wait_timout was set as default and thread_stack was set as for 32bit machines.
When we apply the values, I was expecting to see the drop packet issue solve, instead MySQL was managing much better the incoming connection, but the drop packet were still there.
Also the bad network behaviour, was preventing to the data to flow as it was suppose to be.
We then focus on why this was happening reviewing all the changes applied.
After few investigations and researches the customer, realize that the value for the receive window on the NIC, was not really applied,
because also if it is declared as dynamic value by Cisco, the facto it require a reboot of the machines.
We reboot it, and the NIC was now working properly. Data was floating fine with no drop packet.
MySQL was managing the incoming thread efficiently, but I notice, after the server warm-up, that the performance were still not optimal.
So doing some other tuning, I set thread_cache_size to 1024 paring the back_log number,
at this point MySQL was managing the incoming new request very efficiently, we had a jump to of Threads_cached to 850 with obvious flotation between 8 threads up to the maximum,
Threads_created the same and then just a bit larger then the maximum number of created connections, and finally Threads_running jump from 200 - 300 to 600 - 700.
The drop packet is a sign of insufficient buffer, either from the NIC or the TCP, remember to review the parameters related.
"Surprising" also if the thread creation process was alway declare as "easy" and "light" a more aggressive thread settings makes MySQL act more properly.