Skip to content

升级mysql引发的第三方库问题

Published:

背景: 项目使用了一个老旧的第三方mysql库,在mysql 5.7正常运行,升级到mysql 8.0后,出现奇怪的报错。

最终定位问题在gen_tcp:recv的处理上,第三方库只判断gen_tcp:recv有没有返回数据,而没有对返回数据进行严谨的判断。

而触发这个问题的原因是mysql的超时机制处理上,版本存在差异。

为了模拟mysql超时断开连接的情况,我们将mysql的wait_timeout设置成30秒,那么如果客户端在30秒内没有与mysql进行通讯,mysql会作超时处理,断开连接。

登录mysql,输入如下命令,设置wait_timeout为30秒

set global wait_timeout = 30;

接下来,我们通过抓包分析其中的差异。

以下是erlang与mysql 5.7的通讯包:

No.	Time		Source		Destination	Protocol Length	Info
299	7.093504	192.168.91.129	172.18.30.109	TCP	66	40233 → 3306 [SYN] Seq=0 Win=11680 Len=0 MSS=1460 SACK_PERM=1 WS=512
301	7.093642	172.18.30.109	192.168.91.129	TCP	58	3306 → 40233 [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460
302	7.093747	172.18.30.109	192.168.91.129	MySQL	136	Server Greeting proto=10 version=5.7.44-log
303	7.093936	192.168.91.129	172.18.30.109	TCP	54	40233 → 3306 [ACK] Seq=1 Ack=1 Win=11680 Len=0
305	7.093972	192.168.91.129	172.18.30.109	TCP	54	40233 → 3306 [ACK] Seq=1 Ack=83 Win=11598 Len=0
306	7.094075	192.168.91.129	172.18.30.109	MySQL	116	Login Request user=root
307	7.094109	172.18.30.109	192.168.91.129	TCP	54	3306 → 40233 [ACK] Seq=83 Ack=63 Win=64240 Len=0
308	7.094148	172.18.30.109	192.168.91.129	MySQL	65	Response OK
309	7.094271	192.168.91.129	172.18.30.109	MySQL	92	Request Query
310	7.094307	172.18.30.109	192.168.91.129	TCP	54	3306 → 40233 [ACK] Seq=94 Ack=101 Win=64240 Len=0
311	7.094378	172.18.30.109	192.168.91.129	MySQL	65	Response OK
312	7.094470	192.168.91.129	172.18.30.109	MySQL	75	Request Query
313	7.094507	172.18.30.109	192.168.91.129	TCP	54	3306 → 40233 [ACK] Seq=105 Ack=122 Win=64240 Len=0
314	7.094542	172.18.30.109	192.168.91.129	MySQL	65	Response OK
315	7.094659	192.168.91.129	172.18.30.109	MySQL	59	Request Ping
316	7.094689	172.18.30.109	192.168.91.129	TCP	54	3306 → 40233 [ACK] Seq=116 Ack=127 Win=64240 Len=0
317	7.094714	172.18.30.109	192.168.91.129	MySQL	65	Response OK
318	7.094873	192.168.91.129	172.18.30.109	MySQL	84	Request Query
319	7.094906	172.18.30.109	192.168.91.129	TCP	54	3306 → 40233 [ACK] Seq=127 Ack=157 Win=64240 Len=0
320	7.095005	172.18.30.109	192.168.91.129	MySQL	117	Response
325	7.135808	192.168.91.129	172.18.30.109	TCP	54	40233 → 3306 [ACK] Seq=157 Ack=190 Win=11491 Len=0
515	27.689432	192.168.91.129	172.18.30.109	MySQL	59	Request Ping
516	27.689550	172.18.30.109	192.168.91.129	TCP	54	3306 → 40233 [ACK] Seq=190 Ack=162 Win=64240 Len=0
517	27.689640	172.18.30.109	192.168.91.129	MySQL	65	Response OK
518	27.689767	192.168.91.129	172.18.30.109	TCP	54	40233 → 3306 [ACK] Seq=162 Ack=201 Win=11480 Len=0
522	27.690005	192.168.91.129	172.18.30.109	MySQL	84	Request Query
523	27.690054	172.18.30.109	192.168.91.129	TCP	54	3306 → 40233 [ACK] Seq=201 Ack=192 Win=64240 Len=0
524	27.690176	172.18.30.109	192.168.91.129	MySQL	117	Response
526	27.730803	192.168.91.129	172.18.30.109	TCP	54	40233 → 3306 [ACK] Seq=192 Ack=264 Win=11417 Len=0
784	57.691504	172.18.30.109	192.168.91.129	TCP	54	3306 → 40233 [FIN, PSH, ACK] Seq=264 Ack=192 Win=64240 Len=0
785	57.731575	192.168.91.129	172.18.30.109	TCP	54	40233 → 3306 [ACK] Seq=192 Ack=265 Win=11416 Len=0
790	58.641446	192.168.91.129	172.18.30.109	MySQL	59	Request Ping
791	58.641530	192.168.91.129	172.18.30.109	TCP	54	40233 → 3306 [FIN, ACK] Seq=197 Ack=265 Win=11416 Len=0
792	58.641572	172.18.30.109	192.168.91.129	TCP	54	3306 → 40233 [ACK] Seq=265 Ack=197 Win=64240 Len=0
794	58.641590	172.18.30.109	192.168.91.129	TCP	54	3306 → 40233 [ACK] Seq=265 Ack=198 Win=64239 Len=0

526和784帧相差30秒,达到mysql设置的wait_timeout

526     27.730803       192.168.91.129  172.18.30.109   TCP     54      40233 → 3306 [ACK] Seq=192 Ack=264 Win=11417 Len=0
784     57.691504       172.18.30.109   192.168.91.129  TCP     54      3306 → 40233 [FIN, PSH, ACK] Seq=264 Ack=192 Win=64240 Len=0

在784帧mysql主动断开连接

784     57.691504       172.18.30.109   192.168.91.129  TCP     54      3306 → 40233 [FIN, PSH, ACK] Seq=264 Ack=192 Win=64240 Len=0

这时候erlang调用gen_tcp:send发送ping命令,返回ok。接下来调用gen_tcp:recv的时候,返回了{error, closed}。

790     58.641446       192.168.91.129  172.18.30.109   MySQL   59      Request Ping

在791帧,可以看到真正断开连接。

791     58.641530       192.168.91.129  172.18.30.109   TCP     54      40233 → 3306 [FIN, ACK] Seq=197 Ack=265 Win=11416 Len=0

接下来看看mysql 8.0的通讯包

No.	Time		Source		Destination	Protocol Length	Info
20652	108.939846	192.168.91.129	172.18.30.109	MySQL	59	Request Ping
20653	108.939892	172.18.30.109	192.168.91.129	TCP	54	3306 → 63886 [ACK] Seq=3575 Ack=1169 Win=64240 Len=0
20654	108.939931	172.18.30.109	192.168.91.129	MySQL	65	Response OK
20655	108.940004	192.168.91.129	172.18.30.109	TCP	54	63886 → 3306 [ACK] Seq=1169 Ack=3586 Win=10777 Len=0
20656	108.940100	192.168.91.129	172.18.30.109	MySQL	148	Request Query
20657	108.940140	172.18.30.109	192.168.91.129	TCP	54	3306 → 63886 [ACK] Seq=3586 Ack=1263 Win=64240 Len=0
20658	108.940316	172.18.30.109	192.168.91.129	MySQL	637	Response
20699	108.980314	192.168.91.129	172.18.30.109	TCP	54	63886 → 3306 [ACK] Seq=1263 Ack=4169 Win=10777 Len=0
20784	138.940408	172.18.30.109	192.168.91.129	MySQL	203	Response Error 4031
20795	138.980294	192.168.91.129	172.18.30.109	TCP	54	63886 → 3306 [ACK] Seq=1263 Ack=4319 Win=10777 Len=0
20908	194.786344	192.168.91.129	172.18.30.109	MySQL	59	Request Ping
20910	194.786432	172.18.30.109	192.168.91.129	TCP	54	3306 → 63886 [ACK] Seq=4319 Ack=1268 Win=64240 Len=0
20911	194.786604	192.168.91.129	172.18.30.109	MySQL	84	Request Query
20912	194.786647	192.168.91.129	172.18.30.109	TCP	54	63886 → 3306 [FIN, ACK] Seq=1298 Ack=4319 Win=10777 Len=0
20913	194.786654	172.18.30.109	192.168.91.129	TCP	54	3306 → 63886 [ACK] Seq=4319 Ack=1298 Win=64210 Len=0
20914	194.786670	172.18.30.109	192.168.91.129	TCP	54	3306 → 63886 [ACK] Seq=4319 Ack=1299 Win=64209 Len=0

20699和20784帧相差30秒,达到mysql设置的wait_timeout

20699   108.980314      192.168.91.129  172.18.30.109   TCP     54      63886 → 3306 [ACK] Seq=1263 Ack=4169 Win=10777 Len=0
20784   138.940408      172.18.30.109   192.168.91.129  MySQL   203     Response Error 4031

在20784帧mysql推送了一个数据包,错误码为4031,告诉erlang,连接已经超时。

20784   138.940408      172.18.30.109   192.168.91.129  MySQL   203     Response Error 4031

到了20908帧,erlang调用gen_tcp:send发送ping包,返回ok。接下来调用gen_tcp:recv读取数据,这时候读取的是4031这个数据包,解出来的数据为:

{error_packet,0,4031,<<"HY000">>,"The client was disconnected by the server because of inactivity. See wait_timeout and interactive_timeout for configuring this behavior."}

接下来erlang调用gent_tcp:send发送查询语句,返回ok。再调用gen_tcp:recv读取数据,返回{error, closed}。

20908   194.786344      192.168.91.129  172.18.30.109   MySQL   59      Request Ping

查看20912帧,真正断开连接。

20912   194.786647      192.168.91.129  172.18.30.109   TCP     54      63886 → 3306 [FIN, ACK] Seq=1298 Ack=4319 Win=10777 Len=0

第三方库在发送ping包的时候,在执行gen_tcp:recv的时候,接收到4031这个错误包,但没有进行严谨判断。认为此连接是存活的,导致接下来的查询报错。


Next Post
如何优雅地调整时间?faketime的使用和原理分析