《OpenVPN莫名其妙断线的问题及其解决(13页).doc》由会员分享,可在线阅读,更多相关《OpenVPN莫名其妙断线的问题及其解决(13页).doc(13页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、-OpenVPN莫名其妙断线的问题及其解决-第 13 页OpenVPN莫名其妙断线的问题及其解决1.问题不得不说,这是一个OpenVPN的问题,该问题几乎每个使用OpenVPN的人都碰到过,也有很多人在网上发问,然而一直都没有人能给出解决办法,甚至很多帖子上表示因为这个问题而放弃了使用OpenVPN。说实话,我面临这个问题已经两年有余,自从第一次接触OpenVPN,这个问题就一直困扰着我,去过国内外各大论坛也没有找到满意的结果。这几天终于有点闲暇,我决定自己去摸索一下,要感谢公司能给我提供一个环境!最终,我取得了突破性的进展,还是那句话,我把这个结果贴了出来,就是为了以后人们再面临这个问题时可
2、以多一个可选的答案。 顺便说一下,并不能说明网上就没人解决过这个问题,因为我所能看到并理解的,只有中文或者英文的帖子或者文章,虽然日文的也在我老婆的帮忙翻译下看过一些,但是还有大量的德文,意大利文,韩文等作为母语的人写出的东西我无法找到并且理解它,因此为了通用性,我本应该用英文来写这篇文章,然而英文水平太垃圾,怕那样连中国人都不能理解了. 问题是这样的,OpenVPN在跨越公网上连接时,会莫名其妙的时不时断开,但不经常,也不绝对!由于大部分人使用Windows版本的作为OpenVPN客户端,因此起初一直一为是Windows本身的问题,然而当我用Linux客户端连接时,还是一样,这就是说,很大程
3、度上冤枉了Windows(也并不是完全冤枉,起码Linux就没有DHCP租约的问题),于是既然有了环境,那就折腾一番,因此又是一个惊魂48小时。 以下是在客户端断开时服务端的日志(频繁的断开就会有频繁的日志,现在仅仅截取一段):2013-07-24 16:53:15 MULTI: REAP range 208 - 2242013-07-24 16:53:16 Test证书/218.242.253.131:18014 ACK output sequence broken: 5 1 2 3 42013-07-24 16:53:16 GET INST BY REAL: 218.242.253.131
4、:18014 succeeded2013-07-24 16:53:16 Test证书/218.242.253.131:18014 UDPv4 READ 22 from 218.242.253.131:18014: P_ACK_V1 kid=0 1 2013-07-24 16:53:16 Test证书/218.242.253.131:18014 UDPv4 WRITE 114 to 218.242.253.131:18014: P_CONTROL_V1 kid=0 pid=5 DATA len=1002013-07-24 16:53:16 Test证书/218.242.253.131:18014
5、 ACK output sequence broken: 6 5 2 3 42013-07-24 16:53:16 GET INST BY REAL: 218.242.253.131:18014 succeeded2013-07-24 16:53:16 Test证书/218.242.253.131:18014 UDPv4 READ 22 from 218.242.253.131:18014: P_ACK_V1 kid=0 2 2013-07-24 16:53:16 Test证书/218.242.253.131:18014 UDPv4 WRITE 114 to 218.242.253.131:1
6、8014: P_CONTROL_V1 kid=0 pid=6 DATA len=1002013-07-24 16:53:16 Test证书/218.242.253.131:18014 ACK output sequence broken: 7 5 6 3 42013-07-24 16:53:16 GET INST BY REAL: 218.242.253.131:18014 succeeded2013-07-24 16:53:16 Test证书/218.242.253.131:18014 UDPv4 READ 22 from 218.242.253.131:18014: P_ACK_V1 ki
7、d=0 3 2013-07-24 16:53:16 Test证书/218.242.253.131:18014 UDPv4 WRITE 114 to 218.242.253.131:18014: P_CONTROL_V1 kid=0 pid=7 DATA len=1002013-07-24 16:53:16 Test证书/218.242.253.131:18014 ACK output sequence broken: 8 5 6 7 42013-07-24 16:53:16 GET INST BY REAL: 218.242.253.131:18014 succeeded2013-07-24
8、16:53:16 Test证书/218.242.253.131:18014 UDPv4 READ 22 from 218.242.253.131:18014: P_ACK_V1 kid=0 4 2013-07-24 16:53:16 Test证书/218.242.253.131:18014 UDPv4 WRITE 114 to 218.242.253.131:18014: P_CONTROL_V1 kid=0 pid=8 DATA len=1002013-07-24 16:53:16 Test证书/218.242.253.131:18014 ACK output sequence broken
9、: 9 5 6 7 82013-07-24 16:53:16 GET INST BY REAL: 218.242.253.131:18014 succeeded2013-07-24 16:53:16 Test证书/218.242.253.131:18014 UDPv4 READ 22 from 218.242.253.131:18014: P_ACK_V1 kid=0 5 2013-07-24 16:53:16 Test证书/218.242.253.131:18014 UDPv4 WRITE 114 to 218.242.253.131:18014: P_CONTROL_V1 kid=0 pi
10、d=9 DATA len=1002013-07-24 16:53:16 Test证书/218.242.253.131:18014 ACK output sequence broken: 10 9 6 7 82013-07-24 16:53:16 GET INST BY REAL: 218.242.253.131:18014 succeeded2013-07-24 16:53:16 Test证书/218.242.253.131:18014 UDPv4 READ 22 from 218.242.253.131:18014: P_ACK_V1 kid=0 6 2013-07-24 16:53:16
11、Test证书/218.242.253.131:18014 UDPv4 WRITE 114 to 218.242.253.131:18014: P_CONTROL_V1 kid=0 pid=10 DATA len=1002013-07-24 16:53:16 Test证书/218.242.253.131:18014 ACK output sequence broken: 11 9 10 7 82013-07-24 16:53:16 GET INST BY REAL: 218.242.253.131:18014 succeeded2013-07-24 16:53:16 Test证书/218.242
12、.253.131:18014 UDPv4 READ 22 from 218.242.253.131:18014: P_ACK_V1 kid=0 7 2013-07-24 16:53:16 Test证书/218.242.253.131:18014 UDPv4 WRITE 114 to 218.242.253.131:18014: P_CONTROL_V1 kid=0 pid=11 DATA len=1002013-07-24 16:53:16 Test证书/218.242.253.131:18014 ACK output sequence broken: 12 9 10 11 82013-07-
13、24 16:53:16 GET INST BY REAL: 218.242.253.131:18014 succeeded2013-07-24 16:53:16 Test证书/218.242.253.131:18014 UDPv4 READ 22 from 218.242.253.131:18014: P_ACK_V1 kid=0 9 2013-07-24 16:53:16 Test证书/218.242.253.131:18014 ACK output sequence broken: 12 10 11 82013-07-24 16:53:17 MULTI: REAP range 240 -
14、2562013-07-24 16:53:17 GET INST BY REAL: 218.242.253.131:18014 succeeded2013-07-24 16:53:17 Test证书/218.242.253.131:18014 UDPv4 READ 22 from 218.242.253.131:18014: P_ACK_V1 kid=0 10 2013-07-24 16:53:17 Test证书/218.242.253.131:18014 ACK output sequence broken: 12 11 82013-07-24 16:53:17 GET INST BY REA
15、L: 218.242.253.131:18014 succeeded2013-07-24 16:53:17 Test证书/218.242.253.131:18014 UDPv4 READ 22 from 218.242.253.131:18014: P_ACK_V1 kid=0 11 2013-07-24 16:53:17 Test证书/218.242.253.131:18014 ACK output sequence broken: 12 82013-07-24 16:53:18 MULTI: REAP range 0 - 162013-07-24 16:53:18 Test证书/218.2
16、42.253.131:18014 TLS: tls_pre_encrypt: key_id=02013-07-24 16:53:18 Test证书/218.242.253.131:18014 SENT PING2013-07-24 16:53:18 Test证书/218.242.253.131:18014 ACK output sequence broken: 12 82013-07-24 16:53:18 Test证书/218.242.253.131:18014 UDPv4 WRITE 53 to 218.242.253.131:18014: P_DATA_V1 kid=0 DATA len
17、=522013-07-24 16:53:18 Test证书/218.242.253.131:18014 ACK output sequence broken: 12 8.持续了60秒没有收到ID为8的ACK,因此一直都是ACK output sequence broken: 12 82013-07-24 16:54:15 Test证书/218.242.253.131:18014 TLS Error: TLS key negotiation failed to occur within 60 seconds (check your network connectivity)2013-07-24
18、16:54:15 Test证书/218.242.253.131:18014 TLS Error: TLS handshake failed with peer 0.0.0.0没隔一段时间就会断一次,并且重连还不一定总能重连成功!因此这里的问题有两点:a.连接正常时断开(ping-restart的情况,上述日志没有展示)b.重连时不成功(上述日志展示的)2.分析使用UDP的OpenVPN就是事多,为了避免重传叠加,在恶劣环境下还真得用UDP。然而OpenVPN实现的UDP reliable层是一个高度简化的“按序确认连接”层,它仅仅确保了数据安序到达,并且有确认机制,和TCP那是没法比。不过如果
19、看一下TCP最初的方案,你会发现,TCP的精髓其实就是OpenVPN的reliable层,后来的复杂性都是针对特定情况的优化! 和TCP的实现一样,不对ACK进行ACK对发送端提出了重传滑动窗口未确认包的要求,因为纯ACK可能会丢失,这里先不讨论捎带ACK。ACK一旦丢失,发送端肯定就要重传没有被ACK的包,关键是“什么时候去重传它?”,协议本身一般都有一个或者多个Timer,Timer到期就重传,然而我个人认为这个Timer不能依赖上层,而要在协议本身实现,毕竟重传这种事对上层是不可见的! 然而,OpenVPN的reliable层在ACK丢失的应对方面却什么都没有实现,通过以上的日志可以看出
20、,连续的:Test证书/218.242.253.131:18014 ACK output sequence broken: 12 8说明ID为8的包一直都得不到重传,并且从:2013-07-24 16:53:16 Test证书/218.242.253.131:18014 UDPv4 READ 22 from 218.242.253.131:18014: P_ACK_V1 kid=0 6 2013-07-24 16:53:16 Test证书/218.242.253.131:18014 UDPv4 WRITE 114 to 218.242.253.131:18014: P_CONTROL_V1 k
21、id=0 pid=10 DATA len=1002013-07-24 16:53:16 Test证书/218.242.253.131:18014 ACK output sequence broken: 11 9 10 7 82013-07-24 16:53:16 GET INST BY REAL: 218.242.253.131:18014 succeeded2013-07-24 16:53:16 Test证书/218.242.253.131:18014 UDPv4 READ 22 from 218.242.253.131:18014: P_ACK_V1 kid=0 7 2013-07-24
22、16:53:16 Test证书/218.242.253.131:18014 UDPv4 WRITE 114 to 218.242.253.131:18014: P_CONTROL_V1 kid=0 pid=11 DATA len=1002013-07-24 16:53:16 Test证书/218.242.253.131:18014 ACK output sequence broken: 12 9 10 11 82013-07-24 16:53:16 GET INST BY REAL: 218.242.253.131:18014 succeeded2013-07-24 16:53:16 Test
23、证书/218.242.253.131:18014 UDPv4 READ 22 from 218.242.253.131:18014: P_ACK_V1 kid=0 9 2013-07-24 16:53:16 Test证书/218.242.253.131:18014 ACK output sequence broken: 12 10 11 82013-07-24 16:53:17 MULTI: REAP range 240 - 2562013-07-24 16:53:17 GET INST BY REAL: 218.242.253.131:18014 succeeded2013-07-24 16
24、:53:17 Test证书/218.242.253.131:18014 UDPv4 READ 22 from 218.242.253.131:18014: P_ACK_V1 kid=0 10 这几行日志可以看出,确实是没有收到ID为8的包地ACK,说明它丢失了,接下来发送的数据包将持续填充发送窗口,直到填满,ID为8的包还未重传并且收到对端对其的ACK,因此就导致了ACK output sequence broken,通过查代码,12-8=4,而4正是发送窗口的长度。持续了很久ACK output sequence broken之后,还是没有重传,直到:a.隧道建立之后的ping-restar
25、t过期b.隧道建立阶段的TLS handshake failed实际上,正确的方式应该是,检测到窗口爆满就应该马上重传。TCP通过三次重复ACK知晓丢包,而OpenVPN的reliable则通过ACK output sequence broken知晓ACK丢失,这是一个信号,应该在获取这个信号后做点什么了!3.原始方案方案很简单,那就是在打印ACK output sequence broken的逻辑块内重传所有未确认的包,然而作为一种优化,仅仅重传ID最小的包即可。这是因为,之所以到达ACK output sequence broken,是因为窗口满了,之所以满是因为ID最小的包未确认,占据了
26、很大的一块空间以及其后面的实际上可能已经确认了的空间,因此只要ID最小的包被确认,窗口就放开了,故而仅重传ID最小的包,以期待对端能再次给出确认。 方案虽简单,但是不落实到代码还是0,以下是一些尝试4.第一次尝试-出错7月25日下班后,又睡不着了,自己躲在女儿的小屋,开始了coding。首先确认一下对于乱序或者重放的包,对端也能ACK,如果不能,那就要大改了,找到了ssl.c的代码,在tls_pre_decrypt中:plain view plain copyif (op != P_ACK_V1 & reliable_can_get (ks-rec_reliable) packet_id_ty
27、pe id; /* Extract the packet ID from the packet */ if (reliable_ack_read_packet_id (buf, &id) /* Avoid deadlock by rejecting packet that would de-sequentialize receive buffer */ if (reliable_wont_break_tiality (ks-rec_reliable, id) if (reliable_not_replay (ks-rec_reliable, id) /* Save incoming ciphe
28、rtext packet to reliable buffer */ struct buffer *in = reliable_get_buf (ks-rec_reliable); ASSERT (in); ASSERT (buf_copy (in, buf); reliable_mark_active_incoming (ks-rec_reliable, in, id, op); /注意这个注释,即使是重放包也ACK!而我解决ACK丢失的思路正是重放那个迟迟收不到 /ACK的包,期待对端发送ACK,按照随机丢包概率,针对该包的ACK总不能一直丢失吧! /* Process outgoing
29、acknowledgment for packet just received, even if its a replay */ reliable_ack_acknowledge_packet_id (ks-rec_ack, id); 有了以上的基础,起码我知道,针对OpenVPN的reliable层修改的代码不多!接下来就是找到修改哪里了,当然是哪里出问题修改哪里!之所以僵持在那里,就是因为“ACK output sequence broken”,所以说我找到了打印这个的地方,在reliable_get_buf_output_sequenced函数中:plain view plain cop
30、ystruct buffer * reliable_get_buf_output_sequenced (struct reliable *rel) struct gc_arena gc = gc_new (); int i; packet_id_type min_id = 0; bool min_id_defined = false; struct buffer *ret = NULL; /* find minimum active packet_id */ for (i = 0; i size; +i) const struct reliable_entry *e = &rel-arrayi
31、; if (e-active) if (!min_id_defined | e-packet_id packet_id; /以下判断没有通过的原因,在上面的日志中已经找到了: / . ACK output sequence broken: 12 8 /12-8=4,而#define TLS_RELIABLE_N_SEND_BUFFERS 4 if (!min_id_defined | (int)(rel-packet_id - min_id) size) ret = reliable_get_buf (rel); else dmsg (D_REL_LOW, ACK output sequenc
32、e broken: %s, reliable_print_ids (rel, &gc); gc_free (&gc); return ret; 因此仅仅需要在打印broken的地方重传packet_id为min_id的那个buf即可!plain view plain copy#ifdef RETRY struct buffer * reliable_get_buf_output_sequenced (struct reliable *rel, int *flag) #else struct buffer * reliable_get_buf_output_sequenced (struct r
33、eliable *rel) #endif struct gc_arena gc = gc_new (); int i; packet_id_type min_id = 0; bool min_id_defined = false; struct buffer *ret = NULL; #ifdef RETRY struct buffer *retry_buff = NULL; /not named replay_buffer! *flag = 0; #endif /* find minimum active packet_id */ for (i = 0; i size; +i) const
34、struct reliable_entry *e = &rel-arrayi; if (e-active) if (!min_id_defined | e-packet_id packet_id; #ifdef RETRY /retry_buff = e-buf; ret = &e-buf; #endif /以下判断没有通过的原因,在上面的日志中已经找到了: / . ACK output sequence broken: 12 8 /12-8=4,而#define TLS_RELIABLE_N_SEND_BUFFERS 4 if (!min_id_defined | (int)(rel-pac
35、ket_id - min_id) size) ret = reliable_get_buf (rel); else #ifdef RETRY *flag = 1; #endif dmsg (D_REL_LOW, ACK output sequence broken: %s, reliable_print_ids (rel, &gc); gc_free (&gc); return ret; 相应地,需要修改该函数的调用逻辑,即ssl.c的tls_process,这里不再给出ks-state = S_INITIAL的初始情况:plain view plain copyif (ks-state =
36、S_START) #ifdef RETRY int retry = 0; int status = -1; buf = reliable_get_buf_output_sequenced (ks-send_reliable, &retry); #else buf = reliable_get_buf_output_sequenced (ks-send_reliable); #endif if (buf) #ifdef RETRY if (!retry) #endif status = key_state_read_ciphertext (multi, ks, buf, PAYLOAD_SIZE
37、_DYNAMIC (&multi-opt.frame); if (status = -1) msg (D_TLS_ERRORS, TLS Error: Ciphertext - reliable TCP/UDP transport read error); goto error; #ifdef RETRY else status = 1; #endif if (status = 1) reliable_mark_active_outgoing (ks-send_reliable, buf, P_CONTROL_V1); INCR_GENERATED; state_change = true;
38、dmsg (D_TLS_DEBUG, Outgoing Ciphertext - Reliable); 洋洋洒洒的不合我风格的规整代码,COOL!可是运行之后,ASSERT失败,明明我重发了ID最小的包,却在write_control_auth的:ASSERT (session_id_write_prepend (&session-session_id, buf);这一句华丽丽得退出!发现buf竟然不是我要重传的那个buffer!作为单线程单进程的OpenVPN,不可能有另外什么地方触动这个buf啊!5.第二次尝试-成功失败!夜以沉默,心思向谁说? 然而这个问题没有那么复杂,案件的侦破很简单,
39、那就是看代码,终于找到了reliable_schedule_now函数,关键是它的注释:/* schedule all pending packets for immediate retransmit */重传!对的,是重传!既然OpenVPN本身有了重传,那么我的那个重传就是多此一举了!因此还是按照步骤来吧,直接调用这个接口即可,话说一定要用既有的接口,千万不要重复实现既有逻辑!于是patch变得更加简单了,仅仅修改一个reliable_get_buf_output_sequenced函数即可:plain view plain copystruct buffer * reliable_get
40、_buf_output_sequenced (struct reliable *rel) struct gc_arena gc = gc_new (); int i; packet_id_type min_id = 0; bool min_id_defined = false; struct buffer *ret = NULL; /* find minimum active packet_id */ for (i = 0; i size; +i) const struct reliable_entry *e = &rel-arrayi; if (e-active) if (!min_id_d
41、efined | e-packet_id packet_id; if (!min_id_defined | (int)(rel-packet_id - min_id) size) ret = reliable_get_buf (rel); else #ifdef RETRY reliable_schedule_now (rel); /顺便把日志也改了 dmsg (D_REL_LOW, ACK output sequence broken: %s, retransmit immediately, reliable_print_ids (rel, &gc); #else dmsg (D_REL_L
42、OW, ACK output sequence broken: %s, reliable_print_ids (rel, &gc); #endif gc_free (&gc); return ret; struct buffer * reliable_get_buf_output_sequenced (struct reliable *rel) struct gc_arena gc = gc_new (); int i; packet_id_type min_id = 0; bool min_id_defined = false; struct buffer *ret = NULL; /* find minimum active packet_id */ for (i = 0; i size; +i) const struct reliable_entry *e = &rel-arrayi; if (e-active) if (!min_id_defin