K8S问题排查
问题背景
K8S 集群中修改calico 的网络为vxlan 模式后,发现部分service 在节点上无法访问(实际上是延迟访问,延迟时间稳定在1min3s 左右)。[root@node2 ~]# time curl -s http://10.96.91.255 real 1m3.136s user 0m0.005s sys 0m0.005s原因分析
先确认问题范围,在环境上找多个 service 依次测试发现,如果调用service 的节点和实际pod 不在同一个节点上,则出现延迟,否则请求正常。也就是说跨节点的访问才有问题。而直接用service 对应的podIP 访问,也不存在问题,猜测问题可能出在service 转pod 的过程。
再确认基本环境, OS 、K8S 、calico 等基础环境没有发生任何变化,仅仅是把calico 的网络模式从BGP 改为了vxlan ,但是这个改动改变了集群内service 及pod 的请求路径,也即所有的容器请求需要走节点上新增的calico.vxlan 接口封装一下。网络模式修改前没有问题,修改后必现,之后切回BGP 模式问题就消失了,说明问题可能跟新增的calico.vxlan 接口有关系。
先看下环境情况,并触发 service 请求:# 验证环境:node2(10.10.72.11)——> node1(10.10.72.10) # 验证方法:node2上curl service:10.96.91.255 ——> node1上pod:166.166.166.168:8082 [root@node2 ~]# ip addr 10: vxlan.calico: mtu 1410 qdisc noqueue state UNKNOWN group default link/ether 66:2d:bf:44:a6:8b brd ff:ff:ff:ff:ff:ff inet 166.166.104.10/32 brd 166.166.104.10 scope global vxlan.calico valid_lft forever preferred_lft forever [root@node1 ~]# ip addr 15: vxlan.calico: mtu 1410 qdisc noqueue state UNKNOWN group default link/ether 66:f9:37:c3:7e:94 brd ff:ff:ff:ff:ff:ff inet 166.166.166.175/32 brd 166.166.166.175 scope global vxlan.calico valid_lft forever preferred_lft forever [root@node2 ~]# time curl http://10.96.91.255
在 node1 的主机网卡上抓包看看封装后的请求是否已到达:[root@node1 ~]# tcpdump -n -vv -i eth0 host 10.10.72.11 and udp tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 07:19:42.730996 IP (tos 0x0, ttl 64, id 6470, offset 0, flags [none], proto UDP (17), length 110) 10.10.72.11.nim-vdrshell > 10.10.72.10.4789: [bad udp cksum 0xffff -> 0x3af0!] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 64, id 39190, offset 0, flags [DF], proto TCP (6), length 60) 166.166.104.10.35632 > 166.166.166.168.us-cli: Flags [S], cksum 0xe556 (correct), seq 3025623348, win 29200, options [mss 1460,sackOK,TS val 101892130 ecr 0,nop,wscale 7], length 0 07:19:43.733741 IP (tos 0x0, ttl 64, id 6804, offset 0, flags [none], proto UDP (17), length 110) 10.10.72.11.nim-vdrshell > 10.10.72.10.4789: [bad udp cksum 0xffff -> 0x3af0!] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 64, id 39191, offset 0, flags [DF], proto TCP (6), length 60) 166.166.104.10.35632 > 166.166.166.168.us-cli: Flags [S], cksum 0xe16b (correct), seq 3025623348, win 29200, options [mss 1460,sackOK,TS val 101893133 ecr 0,nop,wscale 7], length 0 07:19:45.736729 IP (tos 0x0, ttl 64, id 7403, offset 0, flags [none], proto UDP (17), length 110) 10.10.72.11.nim-vdrshell > 10.10.72.10.4789: [bad udp cksum 0xffff -> 0x3af0!] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 64, id 39192, offset 0, flags [DF], proto TCP (6), length 60) 166.166.104.10.35632 > 166.166.166.168.us-cli: Flags [S], cksum 0xd998 (correct), seq 3025623348, win 29200, options [mss 1460,sackOK,TS val 101895136 ecr 0,nop,wscale 7], length 0 07:19:49.744801 IP (tos 0x0, ttl 64, id 9648, offset 0, flags [none], proto UDP (17), length 110) 10.10.72.11.nim-vdrshell > 10.10.72.10.4789: [bad udp cksum 0xffff -> 0x3af0!] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 64, id 39193, offset 0, flags [DF], proto TCP (6), length 60) 166.166.104.10.35632 > 166.166.166.168.us-cli: Flags [S], cksum 0xc9f0 (correct), seq 3025623348, win 29200, options [mss 1460,sackOK,TS val 101899144 ecr 0,nop,wscale 7], length 0 07:19:57.768735 IP (tos 0x0, ttl 64, id 12853, offset 0, flags [none], proto UDP (17), length 110) 10.10.72.11.nim-vdrshell > 10.10.72.10.4789: [bad udp cksum 0xffff -> 0x3af0!] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 64, id 39194, offset 0, flags [DF], proto TCP (6), length 60) 166.166.104.10.35632 > 166.166.166.168.us-cli: Flags [S], cksum 0xaa98 (correct), seq 3025623348, win 29200, options [mss 1460,sackOK,TS val 101907168 ecr 0,nop,wscale 7], length 0 07:20:05.087057 IP (tos 0x0, ttl 64, id 8479, offset 0, flags [none], proto UDP (17), length 164) 10.10.72.10.34565 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 63, id 3425, offset 0, flags [DF], proto UDP (17), length 114) 166.166.166.168.57850 > 166.166.104.6.domain: [udp sum ok] 10121+ AAAA? influxdb-nginx-service.kube-system.svc.kube-system.svc.cluster.local. (86) 07:20:05.087076 IP (tos 0x0, ttl 64, id 54475, offset 0, flags [none], proto UDP (17), length 164) 10.10.72.10.51841 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 63, id 3424, offset 0, flags [DF], proto UDP (17), length 114) 166.166.166.168.57984 > 166.166.104.6.domain: [udp sum ok] 20020+ A? influxdb-nginx-service.kube-system.svc.kube-system.svc.cluster.local. (86) 07:20:05.087671 IP (tos 0x0, ttl 64, id 13540, offset 0, flags [none], proto UDP (17), length 257) 10.10.72.11.60395 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 63, id 19190, offset 0, flags [DF], proto UDP (17), length 207) 166.166.104.6.domain > 166.166.166.168.57850: [udp sum ok] 10121 NXDomain*- q: AAAA? influxdb-nginx-service.kube-system.svc.kube-system.svc.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1647633218 7200 1800 86400 5 (179) 07:20:05.087702 IP (tos 0x0, ttl 64, id 13541, offset 0, flags [none], proto UDP (17), length 257) 10.10.72.11.48571 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 63, id 19191, offset 0, flags [DF], proto UDP (17), length 207) 166.166.104.6.domain > 166.166.166.168.57984: [udp sum ok] 20020 NXDomain*- q: A? influxdb-nginx-service.kube-system.svc.kube-system.svc.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1647633218 7200 1800 86400 5 (179) 07:20:05.088801 IP (tos 0x0, ttl 64, id 8480, offset 0, flags [none], proto UDP (17), length 152) 10.10.72.10.55780 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 63, id 3427, offset 0, flags [DF], proto UDP (17), length 102) 166.166.166.168.56015 > 166.166.104.6.domain: [udp sum ok] 19167+ AAAA? influxdb-nginx-service.kube-system.svc.svc.cluster.local. (74) 07:20:05.089048 IP (tos 0x0, ttl 64, id 13542, offset 0, flags [none], proto UDP (17), length 245) 10.10.72.11.50151 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 63, id 19192, offset 0, flags [DF], proto UDP (17), length 195) 166.166.104.6.domain > 166.166.166.168.56015: [udp sum ok] 19167 NXDomain*- q: AAAA? influxdb-nginx-service.kube-system.svc.svc.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1647633218 7200 1800 86400 5 (167) 07:20:05.089212 IP (tos 0x0, ttl 64, id 8481, offset 0, flags [none], proto UDP (17), length 148) 10.10.72.10.50272 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 63, id 3430, offset 0, flags [DF], proto UDP (17), length 98) 166.166.166.168.54926 > 166.166.104.6.domain: [udp sum ok] 40948+ A? influxdb-nginx-service.kube-system.svc.cluster.local. (70) 07:20:05.089403 IP (tos 0x0, ttl 64, id 13543, offset 0, flags [none], proto UDP (17), length 241) 10.10.72.11.59882 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 63, id 19193, offset 0, flags [DF], proto UDP (17), length 191) 166.166.104.6.domain > 166.166.166.168.54926: [udp sum ok] 40948 NXDomain*- q: A? influxdb-nginx-service.kube-system.svc.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1647633218 7200 1800 86400 5 (163) 07:20:05.089524 IP (tos 0x0, ttl 64, id 8482, offset 0, flags [none], proto UDP (17), length 134) 10.10.72.10.58964 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 63, id 3431, offset 0, flags [DF], proto UDP (17), length 84) 166.166.166.168.50263 > 166.166.104.6.domain: [udp sum ok] 18815+ A? influxdb-nginx-service.kube-system.svc. (56) 07:20:05.089681 IP (tos 0x0, ttl 64, id 13544, offset 0, flags [none], proto UDP (17), length 134) 10.10.72.11.51874 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 63, id 19194, offset 0, flags [DF], proto UDP (17), length 84) 166.166.104.6.domain > 166.166.166.168.50263: [udp sum ok] 18815 ServFail- q: A? influxdb-nginx-service.kube-system.svc. 0/0/0 (56) 07:20:05.089706 IP (tos 0x0, ttl 64, id 8483, offset 0, flags [none], proto UDP (17), length 134) 10.10.72.10.59891 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 63, id 3433, offset 0, flags [DF], proto UDP (17), length 84) 166.166.166.168.49202 > 166.166.104.6.domain: [udp sum ok] 58612+ AAAA? influxdb-nginx-service.kube-system.svc. (56) 07:20:05.089859 IP (tos 0x0, ttl 64, id 13545, offset 0, flags [none], proto UDP (17), length 134) 10.10.72.11.44146 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096 IP (tos 0x0, ttl 63, id 19195, offset 0, flags [DF], proto UDP (17), length 84) 166.166.104.6.domain > 166.166.166.168.49202: [udp sum ok] 58612 ServFail- q: AAAA? influxdb-nginx-service.kube-system.svc. 0/0/0 (56)
从抓包结果看,出现一个可疑点:前几个报文中提示 bad udp cksum 0xffff ,请求通的最后几个报文提示的是no cksum 。
根据这个错误信息,搜索发现是个已知 bug ,相关的详细定位可以参考[1]-[3],这里就不细说了。大概原因如下所述:内核中存在一个和VXLAN处理有关的缺陷,该缺陷会导致Checksum Offloading不能正确完成。这个缺陷仅仅在很边缘的场景下才会表现出来。
在 VXLAN 的UDP 头被NAT 过的前提下,如果:
VXLAN 设备禁用(这是RFC 的建议)了UDP Checksum
VXLAN 设备启用了Tx Checksum Offloading
就会导致生成错误的 UDP Checksum 。
从资料[1]看, K8S 的v1.18.5 版本已经修复了这个问题,但我的问题是在v1.21.0 上发现的,所以不确定只升级K8S 是否可以解决该问题,或者升级后还需要额外配置什么?
从资料[3]和[4]看, calico 在v3.20.0 版本做了修改:在kernels < v5.7 时也禁用了calico.vxlan 接口的Offloading 功能。
本地临时禁用并验证: [root@node2 ~]# ethtool --offload vxlan.calico rx off tx off Actual changes: rx-checksumming: off tx-checksumming: off tx-checksum-ip-generic: off tcp-segmentation-offload: off tx-tcp-segmentation: off [requested on] tx-tcp-ecn-segmentation: off [requested on] tx-tcp6-segmentation: off [requested on] tx-tcp-mangleid-segmentation: off [requested on] udp-fragmentation-offload: off [requested on][root@node2 ~]# time curl http://10.96.91.255 real 0m0.009s user 0m0.002s sys 0m0.007s
请求恢复正常。 解决方案临时解决: ethtool --offload vxlan.calico rx off tx off 永久解决:升级 calico >=v3.20.0 或升级内核到5.6.13, 5.4.41, 4.19.123, 4.14.181 ,单独升级K8S >= v1.18.5 版本待确认是否能解决参考资料https://blog.gmem.cc/nodeport-63s-delay-due-to-kernel-issue https://cloudnative.to/blog/kubernetes-1-17-vxlan-63s-delay/ https://bowser1704.github.io/posts/vxlan-bug/ https://github.com/projectcalico/calico/issues/3145 https://github.com/projectcalico/felix/pull/2811/files