Home CloudAWS AWS: EKS running out of IPs, increase pod density

AWS: EKS running out of IPs, increase pod density

Calico and CNI Custom networking with prefix assignment mode

by Kliment Andreev
2.7K views

When I was doing some proof of concept using EKS, I’ve noticed that there is a limitation on how many pods can a node run and how EKS CNI assigns the subnet IPs. Apparently, a node can run a maximum of XX pods (based on type) which is very low, e.g. a t3.large instance with 2 CPUs and 8GB RAM can run only 35 pods. On top of that, the IPs that are assigned for the pods are used from the existing pool of IPs and if you have a /24 subnet for the EKS cluster, you’ll run out of IPs in no time. Here is an example of what I did and options how to solve this issue.

Option 1 – Calico 3rd party CNI

I’ll provision an EKS cluster in a /24 subnet with 3 public and 3 private subnets.

eksctl create cluster --name eksECIC --region us-east-2 --instance-types t3.large \
    --managed --vpc-cidr 192.168.100.0/24 --node-private-networking --version 1.24 --without-nodegroup

I am using the Ohio region which has 3 availability zones. The public and private subnets (total of 6) have /27 subnet which is 32 IPs. Not all of them are usable of course. Let’s create a node group with 3 nodes, and because t3.large is a Nitro instance, we can set the max number of pods to be 110.

eksctl create nodegroup --cluster eksECIC --name old-nodegroup --nodes 3 \
    --node-type t3.large --node-private-networking --managed --max-pods-per-node 200

These are the available 31 IPs that I have in the private subnets.

Let’s provision a small deployment with a single pod and 10 replicas and a network load balancer.

kubectl apply -f demo.yaml

…and this is the deployment file demo.yaml.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo
spec:
  replicas: 10
  selector:
    matchLabels:
      run: demo
  template:
    metadata:
      labels:
        run: demo
    spec:
      containers:
      - name: demo
        image: klimenta/serverip
        ports:
        - containerPort: 3000
---
apiVersion: v1
kind: Service
metadata:
  name: loadbalancer
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
spec:
  ports:
    - port: 80
      targetPort: 3000
      protocol: TCP
  type: LoadBalancer
  selector:
    run: demo

If you look at the available address now, there are 7. So from 31 to 7 with just 10 pods. Check this guy’s blog to see what’s going on.

Let’s scale up the cluster to 100 pods.

kubectl scale --replicas=100 deployment/demo

After a minute or two, check the running pods.

kubectl get pods --field-selector=status.phase=Running | wc -l
69

The console will show 0 available subnets.

If you check the pods with kubectl get pods, you’ll see the remaining containers have a ContainerCreating status.
And if you check one of them, you’ll see this message.

kubectl describe pod demo-698f6fc958-z8smb
....
d9b65ccfeccb7e9e7433ec61fef78faf83d1bae": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  2m29s (x17 over 5m59s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "edfa987c3eddf415068691f914d740fa9ca9ca6ffdfb8db44d1340d913ece0b0": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

Get the load balancer service.

kubectl get svc
NAME           TYPE           CLUSTER-IP      EXTERNAL-IP                                                                     PORT(S)        AGE
kubernetes     ClusterIP      10.100.0.1      <none>                                                                          443/TCP        37m
loadbalancer   LoadBalancer   10.100.25.241   a105789cd3a264ac596c367fce463640-80d2b88acdc17771.elb.us-east-2.amazonaws.com   80:32396/TCP   17m

If you go to a105789cd3a264…amazonaws.com URL, you’ll see the load balancer hitting different pods. Wait for 2-3 minutes if you see that the page can’t be opened. It takes time for DNS to propagate.

Let’s delete the deployment and the old node group.

kubectl delete -f demo.yaml
eksctl delete nodegroup --name old-nodegroup --cluster eksECIC

At this point, we still have the EKS cluster, the etcd database, networking etc, we just don’t have the nodes. Now it’s time to remove the EKS CNI, install Calico and add nodes that will use Calico.
Installation is very simple. Remove the AWS CNI.

kubectl delete daemonset -n kube-system aws-node

Check the pods in all namespaces.

kubectl get pods --all-namespaces
NAMESPACE     NAME                       READY   STATUS    RESTARTS   AGE
kube-system   coredns-5c5677bc78-2k88b   0/1     Pending   0          4m
kube-system   coredns-5c5677bc78-2zz2n   0/1     Pending   0          4m

You will see coredns in pending state, it can’t be deployed anywhere.
Deploy Calico.

kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.25.1/manifests/tigera-operator.yaml

…keep going…

kubectl create -f - <<EOF
kind: Installation
apiVersion: operator.tigera.io/v1
metadata:
  name: default
spec:
  kubernetesProvider: EKS
  cni:
    type: Calico
  calicoNetwork:
    bgp: Disabled
EOF

Deploy the new node group.

eksctl create nodegroup --cluster eksECIC --name new-nodegroup --nodes 3 \
    --node-type t3.large --node-private-networking --managed --max-pods-per-node 200

Let’s deploy the same demo deployment now.

kubectl apply -f demo.yaml

…scale it up to 601 pods.

kubectl scale --replicas=601 deployment/demo

Wait 2-3 mins and check the running ones.

kubectl get pods --field-selector=status.phase=Running | wc -l

There will be 585 pods running, 195 per node. Much, much better. This time the remaining containers will be in Pending state.

kubectl describe pod demo-698f6fc958-smlbq
....
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  3m27s (x3 over 13m)  default-scheduler  0/3 nodes are available: 3 Too many pods. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.

Option 2 – CNI Custom networking and prefix assignment mode

In this case, we’ll keep the AWS CNI network plugin, but we’ll use what is called CNI Custom networking, meaning we’ll use another CIDR attached to the VPC where pods will run. NOTE: The nodes will still run in the 192.168/24 subnets, it’s just the pods that will run in a much bigger IP space. In addition, this CNI Custom networking won’t solve the density of the pods that we can run on the nodes. We’ll need something called prefix assignment mode that will allow us to run much more pods on a node than usual. See this and this links for more information. For more info regarding prefix assignment mode check out this link. The solution described below is much more complicated than if you just use Calico.
Let’s provision a cluster with the same VPC CIDR and no nodegroups.

eksctl create cluster --name eksECIC --region us-east-2 --instance-types t3.large \
    --managed --vpc-cidr 192.168.100.0/24 --node-private-networking --version 1.24 --without-nodegroup

Add a new CIDR and associate it with the VPC. I am using the console but you can use aws ec2 associate-vpc-cidr-block command.Go to the VPC where the EKS cluster resides. From the Actions button, select Edit CIDRs and then click the Add new IPv4 CIDR button.

I’ll add 100.64.0.0/16 CIDR and then I’ll create 3 /19 subnets.

These 3 subnets will be where the pods will run. I also want them to initiate connections to Internet, so I’ll modify the route table and add the Internet Gateway. The three new subnets 100.64/19 that we created have the same route table.

Just add the route to 0.0.0.0/0 to go over Internet Gateway.

Let’s configure the custom networking on the cluster.

kubectl set env daemonset aws-node -n kube-system AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true

Get the EKS cluster security group.

sec_grp=$(aws eks describe-cluster --name eksECIC --query cluster.resourcesVpcConfig.clusterSecurityGroupId --output text)
echo $sec_grp

For each of the subnets create a custom resource. Define the three 100.64 subnet-ids as variables.

export sub1=subnet-0f3ec5d75caf1b981
export sub2=subnet-0fc39edc5a1ce91c0
export sub3=subnet-0bd6bdfd5543aa086
cat >us-east-2a.yaml <<EOF
apiVersion: crd.k8s.amazonaws.com/v1alpha1
kind: ENIConfig
metadata: 
  name: us-east-2a
spec: 
  securityGroups: 
    - $sec_grp
  subnet: $sub1
EOF

2nd subnet.

cat >us-east-2b.yaml <<EOF
apiVersion: crd.k8s.amazonaws.com/v1alpha1
kind: ENIConfig
metadata: 
  name: us-east-2b
spec: 
  securityGroups: 
    - $sec_grp
  subnet: $sub2
EOF

3rd subnet.

cat >us-east-2c.yaml <<EOF
apiVersion: crd.k8s.amazonaws.com/v1alpha1
kind: ENIConfig
metadata: 
  name: us-east-2c
spec: 
  securityGroups: 
    - $sec_grp
  subnet: $sub3
EOF

NOTE:Make sure you name the metadata in line 5 as I did based on your region and AZ.
Deploy the custom resources.

kubectl apply -f us-east-2a.yaml
kubectl apply -f us-east-2b.yaml
kubectl apply -f us-east-2c.yaml

Confirm it looks good.

kubectl get ENIConfigs
NAME         AGE
us-east-2a   16s
us-east-2b   12s
us-east-2c   8s

Update your aws-node DaemonSet to automatically apply the ENIConfig for an Availability Zone to any new Amazon EC2 nodes created in your cluster.

kubectl set env daemonset aws-node -n kube-system ENI_CONFIG_LABEL_DEF=topology.kubernetes.io/zone

Enable prefix assignment mode. This will allow much more pods per node.

kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true

Another change.

kubectl set env ds aws-node -n kube-system WARM_PREFIX_TARGET=1

With the default setting, WARM_PREFIX_TARGET will allocate one additional complete (/28) prefix even if the existing prefix is used by only one pod. If the ENI does not have enough space to assign a prefix, a new ENI is generated.
Provision the node group.

eksctl create nodegroup --cluster eksECIC --name my-nodegroup --nodes 3 \
    --node-type t3.large --node-private-networking --managed --max-pods-per-node 200

Check the nodes, they are running in the 192.168/24 original VPC CIDR. See the IP in the names of the nodes.

kubectl get nodes
NAME                                            STATUS   ROLES    AGE     VERSION
ip-192-168-100-122.us-east-2.compute.internal   Ready    <none>   7m39s   v1.24.13-eks-0a21954
ip-192-168-100-155.us-east-2.compute.internal   Ready    <none>   7m39s   v1.24.13-eks-0a21954
ip-192-168-100-167.us-east-2.compute.internal   Ready    <none>   7m37s   v1.24.13-eks-0a21954

Check the pods.

kubectl get pods -A -o wide
NAMESPACE     NAME                       READY   STATUS    RESTARTS   AGE    IP                NODE                                            NOMINATED NODE   READINESS GATES
kube-system   aws-node-txwrd             1/1     Running   0          9m4s   192.168.100.167   ip-192-168-100-167.us-east-2.compute.internal   <none>           <none>
kube-system   aws-node-vdmv7             1/1     Running   0          9m6s   192.168.100.155   ip-192-168-100-155.us-east-2.compute.internal   <none>           <none>
kube-system   aws-node-xss7c             1/1     Running   0          9m6s   192.168.100.122   ip-192-168-100-122.us-east-2.compute.internal   <none>           <none>
kube-system   coredns-5c5677bc78-2lkcx   1/1     Running   0          80m    100.64.76.225     ip-192-168-100-122.us-east-2.compute.internal   <none>           <none>
kube-system   coredns-5c5677bc78-mj6wr   1/1     Running   0          80m    100.64.76.224     ip-192-168-100-122.us-east-2.compute.internal   <none>           <none>
kube-system   kube-proxy-lt7v4           1/1     Running   0          9m6s   192.168.100.122   ip-192-168-100-122.us-east-2.compute.internal   <none>           <none>
kube-system   kube-proxy-m8pf6           1/1     Running   0          9m4s   192.168.100.167   ip-192-168-100-167.us-east-2.compute.internal   <none>           <none>
kube-system   kube-proxy-nlvw7           1/1     Running   0          9m6s   192.168.100.155   ip-192-168-100-155.us-east-2.compute.internal   <none>           <none>

Look at coredns pods. They are running in the new CIDR.
Deploy the demo.

kubectl apply -f demo.yaml

Get the load balancer service.

kubectl get svc
NAME           TYPE           CLUSTER-IP       EXTERNAL-IP                                                                     PORT(S)        AGE
kubernetes     ClusterIP      10.100.0.1       <none>                                                                          443/TCP        83m
loadbalancer   LoadBalancer   10.100.206.187   a7ad3ecf1d52e4fc58dbbdc550237009-ac4778e11af2c0ad.elb.us-east-2.amazonaws.com   80:30230/TCP   45s

If you go to a7ad3ecf1d…amazonaws.com URL, you’ll see the load balancer hitting different pods. Wait for 2-3 minutes if you see that the page can’t be opened. It takes time for DNS to propagate.

The pods are running in the 100.64 subnet.
Scale up to 601 pods.

kubectl scale --replicas=601 deployment/demo

Wait 2-3 mins and check the running ones.

kubectl get pods --field-selector=status.phase=Running | wc -l

I got 593 total pods. Slightly better than 585 with Calico. The ones that are in pending state show up an error that there are no more resources.

Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  3m49s  default-scheduler  0/3 nodes are available: 3 Too many pods. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.

Related Articles

Leave a Comment

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More