Container Network Packet Drop in AKS

Investigating and mitigating a curious case of container network packet drop in AKS (Azure Kubernetes Service)

Sep 25, 2023

During a recent system outage, our Azure Kubernetes Service (AKS) clusters experienced a peculiar issue. Specifically, some containers suffered packet drops, causing network connectivity problems.

Our AKS clusters run containerized workloads, managed by Cluster API (CAPZ). Each node pool is a Virtual Machine Scale Set (VMSS), which we manage indirectly through the AKS layer.

During the outage, certain workloads on a specific node were affected. Initially, we resolved the issue by cordoning off the node and migrating the workloads. However, the problem recurred shortly after. The primary symptom was a significant increase in network packet drops.

For container network drops, we are relying on metrics exposed by cadvisor:

container_network_receive_packets_dropped_total
container_network_transmit_packets_dropped_total

Investigation

After identifying the packet drop issue, we initiated an investigation to ascertain if network throttling was occurring at the VM level, and sought the help of Azure Support for a thorough examination.

Another common symptom for all problematic nodes was a VM Freeze event, which was observed in the node status conditions. A VM Freeze event can occur due to a variety of reasons, according to Azure documentation.

The Virtual Machine is scheduled to pause for a few seconds. 
CPU and network connectivity may be suspended, but there's no 
impact on memory or open files.

But we have no more visibility on the internals of an Azure VM Freeze event. The preliminary findings from Azure Support indicated no anomalies with the VM, suggesting a review of any alterations in workload behavior. Concurrently, we conducted iPerf tests and captured tcpdump data on our end to delve deeper into the nature of the packet drops and to gain more insights into the network performance hindrances we were facing.

Root Cause Analysis

An intriguing observation was made regarding CPU utilization on the affected node, where it was noticed that one core was being utilized at 100%, while the remaining cores exhibited significantly lower levels of utilization.

This second metric was coming from node-exporter:

node_cpu_seconds_total{} by (cpu)

perf results

The next thing I did was to run a perf on the node to see what was causing the CPU to spike.

perf record -C 5 -a -g -D 99 -- sleep 60

results show that the CPU is being consumed by ksoftirqd/5 process

...
  Children      Self  Command          Shared Object 
+   99.40%     0.00%  ksoftirqd/5      [unknown]     
+   99.34%     0.00%  ksoftirqd/5      [unknown]     
+   99.18%     0.00%  ksoftirqd/5      [unknown]     
+   99.12%     0.00%  ksoftirqd/5      [unknown]     
...

ksoftirqd

ksoftirqd led me to inspect the softirqs. To do this, I had to check the interrupts on the node.

cat /proc/interrupts

     CPU0  CPU1  CPU2  CPU3       CPU4       CPU5  CPU6  CPU7
  4:   0     0   538     0          0          0     0     0 
  8:   0     0     0     0          0          0     0     0 
  9:   0     0     0     0          0          0     0     0 
 24:   0     0     0   586          0   82868004     0     0 
 25: 728     0     0     0  869041985          0     0     0 
 26:   0   864     0     0          0  813776462     0     0 
 27:   0     0  1439     0  838852829          0     1     0 
 28:   0     0     0  1545          0  781818909     0     1 
 29:   1     0     0     0 1234309153          0     0     0 
 30:   0     1     0     0          0 1262389002     0     0 
 31:   0     0     1     0  853755079          0  1172     0 
 32:   0     0     0     1          0  812015919     0  1417

We can clearly see that CPU 4 and CPU 5 are handling way more interrupts than the other CPUs.

smp_affinity

Next thing which I did was to check the smp_affinity of the interrupts.

for i in {24..32} ; do cat /proc/irq/$i/smp_affinity; done
20
10
20
10
20
10
20
10
20

The values 20 and 10 are hexadecimal representations of the CPU core assignments. Specifically, 20 means IRQs are handled by CPU 5, and 10 means they're handled by CPU 4. So we can see that the interrupts are being handled by CPU 4 and CPU 5. This explains the CPU spike on CPU 4 and CPU 5. This also explains the packet drop on the containers running on the node.

To understand better, we need to remember how interrupts are handled in Linux.

In this scenario, the observation of ksoftirqd5 being at 100% CPU utilization indicates a condition wherein the CPU 5 is exhaustively engaged in the handling of interrupts. This state precludes the CPU from accommodating any further interrupt requests, thereby creating a consequential situation where network packets are being discarded. The saturated utilization of the CPU 5 for interrupt handling delineates a bottleneck in the system's capability to process additional interrupts, manifesting as network packet drops.

Just to double-check if this is a common configuration on Azure VMs, I checked the smp_affinity of the interrupts on another node belonging to the same VMSS, which did not have a VM Freeze event yet.

for i in {24..32} ; do cat /proc/irq/$i/smp_affinity; done
40
80
10
80
04
20
08
01
02

IRQBalance

We can see that the interrupts are balanced across all the CPUs. So what is wrong with our node? Why is it not balanced?

Let's check the irqbalance service status

service irqbalance status
● irqbalance.service - irqbalance daemon
     Loaded: loaded 
     Active: active (running) 
       Docs: man:irqbalance(1)

irqbalance is running. But we are definitely not seeing the interrupts distributed across all the CPUs.

systemctl try-restart irqbalance

And right after restarting irqbalance, I could see that the IRQs are balanced across all the CPUs. Packet drop was gone, and CPU utilization was back to normal.

Automated Mitigation

Now that we know what was happening with VM Freeze events and packet drops, and we have a manual mitigation of the issue. It was time for an automated mitigation.

The available metrics can allow us to dig into the number of interrupts and group them by devices or cpu for example. But there is no available metric which can tell us about smp_affinity of the interrupts.

We already have a Daemonset running on all the nodes. So we decided to leverage that to automate the mitigation of the issue. So we extended our existing compute Daemonset to do the following:

emit metrics about the smp_affinity of the interrupts
if the smp_affinity is not balanced, label the node

Now we not only have metrics about the smp_affinity to have observability into the issue, but also we were labelling the node to set the stage for an automated mitigation.

The mitigation was simple at this point.

A new DaemonSet with nodeSelector configured to select the problematic nodes with the label

run nsenter -t 1 -m -n -i systemctl try-restart irqbalance in the container of the new Daemonset

As soon as after a VM Freeze event was leaving our nodes in a problematic state we were able to mitigate the issue automatically.

This is just a temporary mitigation. As this is as far as we can go as users of a managed AKS cluster.

Azure is still investigating this issue, and we are waiting for a permanent fix. Also for most users working with limited network interrupts capacity might not be an issue at all. You can only identify this issue if you are taking the node to substantial network usage.

For this post I used the example of an 8 cores VM. But this issue can happen on any VM size. We observed it in 16,32 and 64 cores VMs. Bigger the node, bigger the issue was due to the proportion of the interrupts capacity.

Conclusion

This was a very interesting case. It was a great learning experience, specially a great reminder on where we are standing as users of a managed Kubernetes cluster.

We are not living in a world where running a managed Kubernetes cluster is a set and forget thing. It's imperative to understand the cluster's internals and possess the capability to debug issues at the cluster level.
Managed services support is great, but it's not a replacement for the knowledge of the internals of the system you are running.

I'm 100% confident that Azure engineering will solve this problem, and we will not have to deal with this issue anymore. But meanwhile if that happens we have to be prepared to dig into the internals of the system and be able to mitigate the issue ourselves.

[update] 10/10/2023: Azure engineering has identified the issue and is rolling out a fix in the next few weeks. This was a bug in irqbalance

The upstream irqbalance 1.9.0+ has been fixed.

Bug Introduction: A specific patch e9e2811 initiated the issue.
Resolution: A subsequent fix has been provided 2a66a66