背景: 项目使用了一个老旧的第三方mysql库,在mysql 5.7正常运行,升级到mysql 8.0后,出现奇怪的报错。
最终定位问题在gen_tcp:recv的处理上,第三方库只判断gen_tcp:recv有没有返回数据,而没有对返回数据进行严谨的判断。
而触发这个问题的原因是mysql的超时机制处理上,版本存在差异。
为了模拟mysql超时断开连接的情况,我们将mysql的wait_timeout设置成30秒,那么如果客户端在30秒内没有与mysql进行通讯,mysql会作超时处理,断开连接。
登录mysql,输入如下命令,设置wait_timeout为30秒
set global wait_timeout = 30;
接下来,我们通过抓包分析其中的差异。
以下是erlang与mysql 5.7的通讯包:
No. Time Source Destination Protocol Length Info
299 7.093504 192.168.91.129 172.18.30.109 TCP 66 40233 → 3306 [SYN] Seq=0 Win=11680 Len=0 MSS=1460 SACK_PERM=1 WS=512
301 7.093642 172.18.30.109 192.168.91.129 TCP 58 3306 → 40233 [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460
302 7.093747 172.18.30.109 192.168.91.129 MySQL 136 Server Greeting proto=10 version=5.7.44-log
303 7.093936 192.168.91.129 172.18.30.109 TCP 54 40233 → 3306 [ACK] Seq=1 Ack=1 Win=11680 Len=0
305 7.093972 192.168.91.129 172.18.30.109 TCP 54 40233 → 3306 [ACK] Seq=1 Ack=83 Win=11598 Len=0
306 7.094075 192.168.91.129 172.18.30.109 MySQL 116 Login Request user=root
307 7.094109 172.18.30.109 192.168.91.129 TCP 54 3306 → 40233 [ACK] Seq=83 Ack=63 Win=64240 Len=0
308 7.094148 172.18.30.109 192.168.91.129 MySQL 65 Response OK
309 7.094271 192.168.91.129 172.18.30.109 MySQL 92 Request Query
310 7.094307 172.18.30.109 192.168.91.129 TCP 54 3306 → 40233 [ACK] Seq=94 Ack=101 Win=64240 Len=0
311 7.094378 172.18.30.109 192.168.91.129 MySQL 65 Response OK
312 7.094470 192.168.91.129 172.18.30.109 MySQL 75 Request Query
313 7.094507 172.18.30.109 192.168.91.129 TCP 54 3306 → 40233 [ACK] Seq=105 Ack=122 Win=64240 Len=0
314 7.094542 172.18.30.109 192.168.91.129 MySQL 65 Response OK
315 7.094659 192.168.91.129 172.18.30.109 MySQL 59 Request Ping
316 7.094689 172.18.30.109 192.168.91.129 TCP 54 3306 → 40233 [ACK] Seq=116 Ack=127 Win=64240 Len=0
317 7.094714 172.18.30.109 192.168.91.129 MySQL 65 Response OK
318 7.094873 192.168.91.129 172.18.30.109 MySQL 84 Request Query
319 7.094906 172.18.30.109 192.168.91.129 TCP 54 3306 → 40233 [ACK] Seq=127 Ack=157 Win=64240 Len=0
320 7.095005 172.18.30.109 192.168.91.129 MySQL 117 Response
325 7.135808 192.168.91.129 172.18.30.109 TCP 54 40233 → 3306 [ACK] Seq=157 Ack=190 Win=11491 Len=0
515 27.689432 192.168.91.129 172.18.30.109 MySQL 59 Request Ping
516 27.689550 172.18.30.109 192.168.91.129 TCP 54 3306 → 40233 [ACK] Seq=190 Ack=162 Win=64240 Len=0
517 27.689640 172.18.30.109 192.168.91.129 MySQL 65 Response OK
518 27.689767 192.168.91.129 172.18.30.109 TCP 54 40233 → 3306 [ACK] Seq=162 Ack=201 Win=11480 Len=0
522 27.690005 192.168.91.129 172.18.30.109 MySQL 84 Request Query
523 27.690054 172.18.30.109 192.168.91.129 TCP 54 3306 → 40233 [ACK] Seq=201 Ack=192 Win=64240 Len=0
524 27.690176 172.18.30.109 192.168.91.129 MySQL 117 Response
526 27.730803 192.168.91.129 172.18.30.109 TCP 54 40233 → 3306 [ACK] Seq=192 Ack=264 Win=11417 Len=0
784 57.691504 172.18.30.109 192.168.91.129 TCP 54 3306 → 40233 [FIN, PSH, ACK] Seq=264 Ack=192 Win=64240 Len=0
785 57.731575 192.168.91.129 172.18.30.109 TCP 54 40233 → 3306 [ACK] Seq=192 Ack=265 Win=11416 Len=0
790 58.641446 192.168.91.129 172.18.30.109 MySQL 59 Request Ping
791 58.641530 192.168.91.129 172.18.30.109 TCP 54 40233 → 3306 [FIN, ACK] Seq=197 Ack=265 Win=11416 Len=0
792 58.641572 172.18.30.109 192.168.91.129 TCP 54 3306 → 40233 [ACK] Seq=265 Ack=197 Win=64240 Len=0
794 58.641590 172.18.30.109 192.168.91.129 TCP 54 3306 → 40233 [ACK] Seq=265 Ack=198 Win=64239 Len=0
526和784帧相差30秒,达到mysql设置的wait_timeout
526 27.730803 192.168.91.129 172.18.30.109 TCP 54 40233 → 3306 [ACK] Seq=192 Ack=264 Win=11417 Len=0
784 57.691504 172.18.30.109 192.168.91.129 TCP 54 3306 → 40233 [FIN, PSH, ACK] Seq=264 Ack=192 Win=64240 Len=0
在784帧mysql主动断开连接
784 57.691504 172.18.30.109 192.168.91.129 TCP 54 3306 → 40233 [FIN, PSH, ACK] Seq=264 Ack=192 Win=64240 Len=0
这时候erlang调用gen_tcp:send发送ping命令,返回ok。接下来调用gen_tcp:recv的时候,返回了{error, closed}。
790 58.641446 192.168.91.129 172.18.30.109 MySQL 59 Request Ping
在791帧,可以看到真正断开连接。
791 58.641530 192.168.91.129 172.18.30.109 TCP 54 40233 → 3306 [FIN, ACK] Seq=197 Ack=265 Win=11416 Len=0
接下来看看mysql 8.0的通讯包
No. Time Source Destination Protocol Length Info
20652 108.939846 192.168.91.129 172.18.30.109 MySQL 59 Request Ping
20653 108.939892 172.18.30.109 192.168.91.129 TCP 54 3306 → 63886 [ACK] Seq=3575 Ack=1169 Win=64240 Len=0
20654 108.939931 172.18.30.109 192.168.91.129 MySQL 65 Response OK
20655 108.940004 192.168.91.129 172.18.30.109 TCP 54 63886 → 3306 [ACK] Seq=1169 Ack=3586 Win=10777 Len=0
20656 108.940100 192.168.91.129 172.18.30.109 MySQL 148 Request Query
20657 108.940140 172.18.30.109 192.168.91.129 TCP 54 3306 → 63886 [ACK] Seq=3586 Ack=1263 Win=64240 Len=0
20658 108.940316 172.18.30.109 192.168.91.129 MySQL 637 Response
20699 108.980314 192.168.91.129 172.18.30.109 TCP 54 63886 → 3306 [ACK] Seq=1263 Ack=4169 Win=10777 Len=0
20784 138.940408 172.18.30.109 192.168.91.129 MySQL 203 Response Error 4031
20795 138.980294 192.168.91.129 172.18.30.109 TCP 54 63886 → 3306 [ACK] Seq=1263 Ack=4319 Win=10777 Len=0
20908 194.786344 192.168.91.129 172.18.30.109 MySQL 59 Request Ping
20910 194.786432 172.18.30.109 192.168.91.129 TCP 54 3306 → 63886 [ACK] Seq=4319 Ack=1268 Win=64240 Len=0
20911 194.786604 192.168.91.129 172.18.30.109 MySQL 84 Request Query
20912 194.786647 192.168.91.129 172.18.30.109 TCP 54 63886 → 3306 [FIN, ACK] Seq=1298 Ack=4319 Win=10777 Len=0
20913 194.786654 172.18.30.109 192.168.91.129 TCP 54 3306 → 63886 [ACK] Seq=4319 Ack=1298 Win=64210 Len=0
20914 194.786670 172.18.30.109 192.168.91.129 TCP 54 3306 → 63886 [ACK] Seq=4319 Ack=1299 Win=64209 Len=0
20699和20784帧相差30秒,达到mysql设置的wait_timeout
20699 108.980314 192.168.91.129 172.18.30.109 TCP 54 63886 → 3306 [ACK] Seq=1263 Ack=4169 Win=10777 Len=0
20784 138.940408 172.18.30.109 192.168.91.129 MySQL 203 Response Error 4031
在20784帧mysql推送了一个数据包,错误码为4031,告诉erlang,连接已经超时。
20784 138.940408 172.18.30.109 192.168.91.129 MySQL 203 Response Error 4031
到了20908帧,erlang调用gen_tcp:send发送ping包,返回ok。接下来调用gen_tcp:recv读取数据,这时候读取的是4031这个数据包,解出来的数据为:
{error_packet,0,4031,<<"HY000">>,"The client was disconnected by the server because of inactivity. See wait_timeout and interactive_timeout for configuring this behavior."}
接下来erlang调用gent_tcp:send发送查询语句,返回ok。再调用gen_tcp:recv读取数据,返回{error, closed}。
20908 194.786344 192.168.91.129 172.18.30.109 MySQL 59 Request Ping
查看20912帧,真正断开连接。
20912 194.786647 192.168.91.129 172.18.30.109 TCP 54 63886 → 3306 [FIN, ACK] Seq=1298 Ack=4319 Win=10777 Len=0
第三方库在发送ping包的时候,在执行gen_tcp:recv的时候,接收到4031这个错误包,但没有进行严谨判断。认为此连接是存活的,导致接下来的查询报错。