Cilium: Support for EFA drivers in EKS

Best AWS, DevOps, Serverless, and more from top Medium writers.

Source-AWS

☸ ️Introduction¶

Elasticity is defined as “The ability to acquire resources as you need them and release resources when you no longer need them” — this is one of the biggest selling points of the cloud. In Amazon EC2 instances, Elastic Fabric Adapters (EFAs) are network devices that accelerate high-performance computing (HPC) and machine learning. EFAs are Elastic Network Adapters (ENAs) with additional OS-bypass capabilities. AWS Elastic Fabric Adapter (EFA) is a specialized network interface for Amazon EC2 instances that allows customers to run high levels of inter-instance communication, such as HPC applications on AWS at scale.

Some use cases for EFA are in weather modelling, semiconductor design, streaming a live sporting event, oil and gas simulations, genomics, finance, and engineering, amongst others.

🎯Goals & Objectives¶

In this article you will learn how to install Cilium on an EKS cluster with the EFA adapter. The test below was done with g5.12xlarge, and Cilium was installed to ensure it can claim support for it. NCCL Tests were run to evaluate the performance of the network using the Nvidia Collective Communication Library.

What’s the difference between ENI, EFA, and ENA in AWS?¶

Elastic Network Interface (ENI): A virtual network interface for basic networking needs, including secondary IP addresses and security group attachment.
Elastic Fabric Adapter (EFA): Designed for high-performance computing (HPC) and machine learning applications, reducing latency for inter-instance communication.
Elastic Network Adapter (ENA): AWS’s standard for high-bandwidth networking, supporting enhanced networking features for optimized throughput.

Why EFA adapter?¶

Due to EFA’s support for libfabric APIs, applications using a supported MPI library can be easily migrated to AWS without having to make any changes to their existing code.
For this reason, AWS EFA is often used in conjunction with Cluster placement groups — which allow physical hosts to be placed much closer together within an AZ to decrease latency even more.
Elastic Fabric Adapter (EFA) is a network interface for Amazon EC2 instances that enables customers to run applications requiring high levels of inter-node communications at scale on AWS.
Its custom-built operating system (OS) bypass hardware interface enhances the performance of inter-instance communications, which is critical to scaling these applications.
With EFA, high-performance computing (HPC) applications using the message-passing interface (MPI) and machine learning (ML) applications using NVIDIA Collective Communications Library (NCCL) can scale to thousands of CPUs or GPUs.
As a result, you get the application performance of on-premises HPC clusters with the on-demand elasticity and flexibility of the AWS cloud.
EFA is an optional EC2 networking feature that can be enabled at no additional cost on any supported EC2 instance.

Supported interfaces and libraries¶

EFAs support the following interfaces and libraries:

Open MPI 4.1 and later
Intel MPI 2019 Update 5 and later
NVIDIA Collective Communications Library (NCCL) 2.4.2 and later
AWS Neuron SDK version 2.3 and later

EFA Limitations¶

EFAs have the following limitations:

All P4d and P5 instance types support NVIDIA GPUDirect Remote Direct Memory Access (RDMA).
EFA traffic between P4d/P4de/DL1 instances and other instance types is currently not supported.
Instance types that support multiple network cards can be configured with one EFA per network card. All other supported instance types support only one EFA per instance.
For c7g.16xlarge, m7g.16xlarge and r7g.16xlarge Dedicated Instances and Dedicated Hosts are not supported when an EFA is attached.
EFA OS-bypass traffic can’t cross Availability Zones, VPCs, or AWS accounts. In other words, EFA OS-bypass traffic can’t flow from one Availability Zone, VPC (with or without a VPC peering connection), or AWS account to another. This does not apply to normal IP traffic from the EFA.

EFA requires all nodes be in a single subnet

2024-09-12 22:35:07 [ℹ]  EFA requires all nodes be in a single subnet, arbitrarily choosing one: [subnet-0f4785c61fbdec025]

EFA OS-bypass traffic can’t be sent across subnets in a Local Zone.
EFA OS-bypass traffic is not routable. Normal IP traffic from the EFA remains routable.
The EFA must be a member of a security group that allows all inbound and outbound traffic to and from the security group itself.
EFA is not supported on Windows instances.
EFA is not supported on AWS Outposts.
The tests below were done keeping in mind multi-GPU VM’s.

Let’s get going¶

Ensure that the requisite instances are available in your region. In this case, g5.12xlarge was chosen.

aws ec2 describe-instance-types  --region us-east-1  --filters Name=network-info.efa-supported,Values=true  --query "InstanceTypes[*].[InstanceType]"  --output text | sort

Sample clusterconfig.yaml file

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: efa-cluster
  region: us-west-2
  version: "1.29"
iam:
  withOIDC: true
availabilityZones: ["us-west-2a", "us-west-2b", "us-west-2c"]
managedNodeGroups:
  - name: my-efa-ng
    iam:
        attachPolicyARNs:
          - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
          - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
          - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
          - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
    instanceType: g5.12xlarge
    minSize: 1
    desiredCapacity: 2
    maxSize: 2
    availabilityZones: ["us-west-2b"]
    volumeSize: 300
    privateNetworking: true
    efaEnabled: true
    # taint nodes so that application pods are
     # not scheduled/executed until Cilium is deployed.
     # Alternatively, see the note below.
    taints:
      - key: "node.cilium.io/agent-not-ready"
        value: "true"
        effect: "NoExecute"

Create an EKS cluster using the sample cluster_config.yaml
```
eksctl create cluster -f ./eks-config_efa.yaml
```

Notice that the EFA device plugin is now being automatically installed.

2024-09-09 15:46:43 [ℹ]  created "kube-system:DaemonSet.apps/aws-efa-k8s-device-plugin-daemonset"
2024-09-09 15:46:44 [ℹ]  as you have enabled EFA, the EFA device plugin was automatically installed.
2024-09-09 15:46:44 [▶]  completed task: install EFA device plugin

Patch the AWS daemonset- In case of ENI mode, Cilium will manage ENIs instead of VPC CNI, so the aws-node DaemonSet has to be patched to prevent conflict behavior.

kubectl -n kube-system patch daemonset aws-node --type='strategic' -p='{"spec":{"template":{"spec":{"nodeSelector":{"io.cilium/aws-node-enabled":"true"}}}}}'

Install Cilium in ENI mode

helm install cilium cilium/cilium --version 1.15.8   --namespace kube-system   --set eni.enabled=true   --set ipam.mode=eni   --set egressMasqueradeInterfaces=eth0   --set routingMode=native

Check that the nodes and pods are Ready and in a healthy state.

kubectl get nodes -o wide
NAME                                          STATUS   ROLES    AGE   VERSION               INTERNAL-IP     EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-192-168-66-72.us-west-2.compute.internal   Ready    <none>   19m   v1.29.6-eks-1552ad0   192.168.66.72   <none>        Amazon Linux 2   5.10.223-212.873.amzn2.x86_64   containerd://1.7.11
ip-192-168-67-38.us-west-2.compute.internal   Ready    <none>   19m   v1.29.6-eks-1552ad0   192.168.67.38   <none>        Amazon Linux 2   5.10.223-212.873.amzn2.x86_64   containerd://1.7.11

kubectl get pods -o wide -A
NAMESPACE     NAME                                        READY   STATUS    RESTARTS   AGE   IP               NODE                                          NOMINATED NODE   READINESS GATES
kube-system   aws-efa-k8s-device-plugin-daemonset-jjlwb   1/1     Running   0          17m   192.168.67.38    ip-192-168-67-38.us-west-2.compute.internal   <none>           <none>
kube-system   aws-efa-k8s-device-plugin-daemonset-vr76k   1/1     Running   0          17m   192.168.66.72    ip-192-168-66-72.us-west-2.compute.internal   <none>           <none>
kube-system   cilium-9kbgf                                1/1     Running   0          15m   192.168.67.38    ip-192-168-67-38.us-west-2.compute.internal   <none>           <none>
kube-system   cilium-kbcxt                                1/1     Running   0          15m   192.168.66.72    ip-192-168-66-72.us-west-2.compute.internal   <none>           <none>
kube-system   cilium-operator-c55d96f4c-6cr97             1/1     Running   0          15m   192.168.66.72    ip-192-168-66-72.us-west-2.compute.internal   <none>           <none>
kube-system   cilium-operator-c55d96f4c-n5hb4             1/1     Running   0          15m   192.168.67.38    ip-192-168-67-38.us-west-2.compute.internal   <none>           <none>
kube-system   coredns-5b8cc885bc-4vmt2                    1/1     Running   0          15m   192.168.89.100   ip-192-168-67-38.us-west-2.compute.internal   <none>           <none>
kube-system   coredns-5b8cc885bc-kf7tx                    1/1     Running   0          15m   192.168.88.68    ip-192-168-66-72.us-west-2.compute.internal   <none>           <none>
kube-system   kube-proxy-lrxp2                            1/1     Running   0          17m   192.168.66.72    ip-192-168-66-72.us-west-2.compute.internal   <none>           <none>
kube-system   kube-proxy-qts58                            1/1     Running   0          18m   192.168.67.38    ip-192-168-67-38.us-west-2.compute.internal   <none>           <none>

The nodegroup template contains bootstrap to install EFA drivers in node. Confirm that the EFA drivers components were successfully installed,

[root@ip-192-168-144-49 /]# fi_info -p efa
provider: efa
    fabric: efa
    domain: rdmap0s26-rdm
    version: 122.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: efa
    domain: rdmap0s26-dgrm
    version: 122.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA

aws-efa-k8s-device-plugin-daemonset pod logs should be similar to:

kubectl-stern efa ds/aws-efa-k8s-device-plugin-daemonset -n kube-system

+ efa-aws-efa-k8s-device-plugin-4kc9h › aws-efa-k8s-device-plugin
+ efa-aws-efa-k8s-device-plugin-vp978 › aws-efa-k8s-device-plugin
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:21:44 Fetching EFA devices.
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:21:44 device: rdmap0s26,uverbs0,/sys/class/infiniband_verbs/uverbs0,/sys/class/infiniband/rdmap0s26
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:21:44 EFA Device list: [{rdmap0s26 uverbs0 /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband/rdmap0s26}]
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:21:44 Starting FS watcher.
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:21:44 Starting OS watcher.
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:21:44 device: rdmap0s26,uverbs0,/sys/class/infiniband_verbs/uverbs0,/sys/class/infiniband/rdmap0s26
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:21:44 Starting to serve on /var/lib/kubelet/device-plugins/aws-efa-device-plugin.sock
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:21:44 Registered device plugin with Kubelet
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:51:24 Request IDs: [&ContainerAllocateRequest{DevicesIDs:[rdmap0s26],}]
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:51:24 Checking if device:\`rdmap0s26\` exists
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:53:24 Request IDs: [&ContainerAllocateRequest{DevicesIDs:[rdmap0s26],}]
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:53:24 Checking if device:\`rdmap0s26\` exists
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:55:48 Request IDs: [&ContainerAllocateRequest{DevicesIDs:[rdmap0s26],}]
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:55:48 Checking if device:\`rdmap0s26\` exists
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:58:18 Request IDs: [&ContainerAllocateRequest{DevicesIDs:[rdmap0s26],}]
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:58:18 Checking if device:\`rdmap0s26\` exists
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:59:59 Request IDs: [&ContainerAllocateRequest{DevicesIDs:[rdmap0s26],}]
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 12:59:59 Checking if device:\`rdmap0s26\` exists
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 13:01:04 Request IDs: [&ContainerAllocateRequest{DevicesIDs:[rdmap0s26],}]
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 13:01:04 Checking if device:\`rdmap0s26\` exists
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 13:06:19 Request IDs: [&ContainerAllocateRequest{DevicesIDs:[rdmap0s26],}]
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 13:06:19 Checking if device:\`rdmap0s26\` exists
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 13:08:58 Request IDs: [&ContainerAllocateRequest{DevicesIDs:[rdmap0s26],}]
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 13:08:58 Checking if device:\`rdmap0s26\` exists
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 13:15:50 Request IDs: [&ContainerAllocateRequest{DevicesIDs:[rdmap0s26],}]
efa-aws-efa-k8s-device-plugin-4kc9h aws-efa-k8s-device-plugin 2024/09/16 13:15:50 Checking if device:\`rdmap0s26\` exists
efa-aws-efa-k8s-device-plugin-vp978 aws-efa-k8s-device-plugin 2024/09/16 12:21:44 Fetching EFA devices.
efa-aws-efa-k8s-device-plugin-vp978 aws-efa-k8s-device-plugin 2024/09/16 12:21:44 device: rdmap0s26,uverbs0,/sys/class/infiniband_verbs/uverbs0,/sys/class/infiniband/rdmap0s26
efa-aws-efa-k8s-device-plugin-vp978 aws-efa-k8s-device-plugin
efa-aws-efa-k8s-device-plugin-vp978 aws-efa-k8s-device-plugin 2024/09/16 12:21:44 EFA Device list: [{rdmap0s26 uverbs0 /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband/rdmap0s26}]
efa-aws-efa-k8s-device-plugin-vp978 aws-efa-k8s-device-plugin 2024/09/16 12:21:44 Starting FS watcher.
efa-aws-efa-k8s-device-plugin-vp978 aws-efa-k8s-device-plugin 2024/09/16 12:21:44 Starting OS watcher.
efa-aws-efa-k8s-device-plugin-vp978 aws-efa-k8s-device-plugin 2024/09/16 12:21:44 device: rdmap0s26,uverbs0,/sys/class/infiniband_verbs/uverbs0,/sys/class/infiniband/rdmap0s26
efa-aws-efa-k8s-device-plugin-vp978 aws-efa-k8s-device-plugin
efa-aws-efa-k8s-device-plugin-vp978 aws-efa-k8s-device-plugin 2024/09/16 12:21:44 Starting to serve on /var/lib/kubelet/device-plugins/aws-efa-device-plugin.sock
efa-aws-efa-k8s-device-plugin-vp978 aws-efa-k8s-device-plugin 2024/09/16 12:21:44 Registered device plugin with Kubelet
efa-aws-efa-k8s-device-plugin-vp978 aws-efa-k8s-device-plugin 2024/09/16 12:51:24 Request IDs: [&ContainerAllocateRequest{DevicesIDs:[rdmap0s26],}]
efa-aws-efa-k8s-device-plugin-vp978 aws-efa-k8s-device-plugin 2024/09/16 12:51:24 Checking if device:\`rdmap0s26\` exists
efa-aws-efa-k8s-device-plugin-vp978 aws-efa-k8s-device-plugin 2024/09/16 13:15:50 Request IDs: [&ContainerAllocateRequest{DevicesIDs:[rdmap0s26],}]
efa-aws-efa-k8s-device-plugin-vp978 aws-efa-k8s-device-plugin 2024/09/16 13:15:50 Checking if device:\`rdmap0s26\` exists

Cilium connectivity tests failed for 1 test with an error code as failed to obtain the eni link list, which most likely would be fixed by upgrading to Cilium 1.16.

cilium connectivity test
ℹ️  Monitor aggregation detected, will skip some flow validation steps
✨ [efa-cluster.us-west-2.eksctl.io] Creating namespace cilium-test for connectivity check...
✨ [efa-cluster.us-west-2.eksctl.io] Deploying echo-same-node service...
✨ [efa-cluster.us-west-2.eksctl.io] Deploying DNS test server configmap...
✨ [efa-cluster.us-west-2.eksctl.io] Deploying same-node deployment...
✨ [efa-cluster.us-west-2.eksctl.io] Deploying client deployment...
✨ [efa-cluster.us-west-2.eksctl.io] Deploying client2 deployment...
✨ [efa-cluster.us-west-2.eksctl.io] Deploying client3 deployment...
✨ [efa-cluster.us-west-2.eksctl.io] Deploying echo-other-node service...
✨ [efa-cluster.us-west-2.eksctl.io] Deploying other-node deployment...
✨ [host-netns] Deploying efa-cluster.us-west-2.eksctl.io daemonset...
✨ [host-netns-non-cilium] Deploying efa-cluster.us-west-2.eksctl.io daemonset...
ℹ️  Skipping tests that require a node Without Cilium
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for deployment cilium-test/client to become ready...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for deployment cilium-test/client2 to become ready...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for deployment cilium-test/echo-same-node to become ready...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for deployment cilium-test/client3 to become ready...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for deployment cilium-test/echo-other-node to become ready...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for pod cilium-test/client-d48766cfd-wst2v to reach DNS server on cilium-test/echo-same-node-6698bd45b-225ss pod...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for pod cilium-test/client2-6b89df6c77-bzkgh to reach DNS server on cilium-test/echo-same-node-6698bd45b-225ss pod...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for pod cilium-test/client3-7f986c467b-gn6k5 to reach DNS server on cilium-test/echo-same-node-6698bd45b-225ss pod...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for pod cilium-test/client-d48766cfd-wst2v to reach DNS server on cilium-test/echo-other-node-5d67f9786b-fpc6b pod...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for pod cilium-test/client2-6b89df6c77-bzkgh to reach DNS server on cilium-test/echo-other-node-5d67f9786b-fpc6b pod...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for pod cilium-test/client3-7f986c467b-gn6k5 to reach DNS server on cilium-test/echo-other-node-5d67f9786b-fpc6b pod...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for pod cilium-test/client-d48766cfd-wst2v to reach default/kubernetes service...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for pod cilium-test/client2-6b89df6c77-bzkgh to reach default/kubernetes service...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for pod cilium-test/client3-7f986c467b-gn6k5 to reach default/kubernetes service...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for Service cilium-test/echo-other-node to become ready...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for Service cilium-test/echo-other-node to be synchronized by Cilium pod kube-system/cilium-9kbgf
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for Service cilium-test/echo-other-node to be synchronized by Cilium pod kube-system/cilium-kbcxt
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for Service cilium-test/echo-same-node to become ready...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for Service cilium-test/echo-same-node to be synchronized by Cilium pod kube-system/cilium-9kbgf
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for Service cilium-test/echo-same-node to be synchronized by Cilium pod kube-system/cilium-kbcxt
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for NodePort 192.168.67.38:32275 (cilium-test/echo-other-node) to become ready...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for NodePort 192.168.67.38:30519 (cilium-test/echo-same-node) to become ready...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for NodePort 192.168.66.72:32275 (cilium-test/echo-other-node) to become ready...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for NodePort 192.168.66.72:30519 (cilium-test/echo-same-node) to become ready...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for DaemonSet cilium-test/host-netns-non-cilium to become ready...
⌛ [efa-cluster.us-west-2.eksctl.io] Waiting for DaemonSet cilium-test/host-netns to become ready...
ℹ️  Skipping IPCache check
🔭 Enabling Hubble telescope...
⚠️  Unable to contact Hubble Relay, disabling Hubble telescope and flow validation: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4245: connect: connection refused"
ℹ️  Expose Relay locally with:
   cilium hubble enable
   cilium hubble port-forward&
ℹ️  Cilium version: 1.15.8
🏃[cilium-test] Running 91 tests ...
[=] [cilium-test] Test [no-unexpected-packet-drops] [1/91]
..
[=] [cilium-test] Test [no-policies] [2/91]
.............................................
[=] [cilium-test] Skipping test [no-policies-from-outside] [3/91] (skipped by condition)
[=] [cilium-test] Test [no-policies-extra] [4/91]
............
[=] [cilium-test] Test [allow-all-except-world] [5/91]
........................
[=] [cilium-test] Test [client-ingress] [6/91]
......
[=] [cilium-test] Test [client-ingress-knp] [7/91]
......
[=] [cilium-test] Test [allow-all-with-metrics-check] [8/91]
......
[=] [cilium-test] Test [all-ingress-deny] [9/91]
............
[=] [cilium-test] Skipping test [all-ingress-deny-from-outside] [10/91] (skipped by condition)
[=] [cilium-test] Test [all-ingress-deny-knp] [11/91]
............
[=] [cilium-test] Test [all-egress-deny] [12/91]
........................
[=] [cilium-test] Test [all-egress-deny-knp] [13/91]
........................
[=] [cilium-test] Test [all-entities-deny] [14/91]
............
[=] [cilium-test] Test [cluster-entity] [15/91]
...
[=] [cilium-test] Skipping test [cluster-entity-multi-cluster] [16/91] (skipped by condition)
[=] [cilium-test] Test [host-entity-egress] [17/91]
......
[=] [cilium-test] Test [host-entity-ingress] [18/91]
....
[=] [cilium-test] Test [echo-ingress] [19/91]
......
[=] [cilium-test] Skipping test [echo-ingress-from-outside] [20/91] (skipped by condition)
[=] [cilium-test] Test [echo-ingress-knp] [21/91]
......
[=] [cilium-test] Test [client-ingress-icmp] [22/91]
......
[=] [cilium-test] Test [client-egress] [23/91]
......
[=] [cilium-test] Test [client-egress-knp] [24/91]
......
[=] [cilium-test] Test [client-egress-expression] [25/91]
......
[=] [cilium-test] Test [client-egress-expression-knp] [26/91]
......
[=] [cilium-test] Test [client-with-service-account-egress-to-echo] [27/91]
......
[=] [cilium-test] Test [client-egress-to-echo-service-account] [28/91]
......
[=] [cilium-test] Test [to-entities-world] [29/91]
.........
[=] [cilium-test] Test [to-cidr-external] [30/91]
......
[=] [cilium-test] Test [to-cidr-external-knp] [31/91]
......
[=] [cilium-test] Skipping test [from-cidr-host-netns] [32/91] (skipped by condition)
[=] [cilium-test] Test [echo-ingress-from-other-client-deny] [33/91]
..........
[=] [cilium-test] Test [client-ingress-from-other-client-icmp-deny] [34/91]
............
[=] [cilium-test] Test [client-egress-to-echo-deny] [35/91]
............
[=] [cilium-test] Test [client-ingress-to-echo-named-port-deny] [36/91]
....
[=] [cilium-test] Test [client-egress-to-echo-expression-deny] [37/91]
....
[=] [cilium-test] Test [client-with-service-account-egress-to-echo-deny] [38/91]
....
[=] [cilium-test] Test [client-egress-to-echo-service-account-deny] [39/91]
..
[=] [cilium-test] Test [client-egress-to-cidr-deny] [40/91]
......
[=] [cilium-test] Test [client-egress-to-cidr-deny-default] [41/91]
......
[=] [cilium-test] Skipping test [clustermesh-endpointslice-sync] [42/91] (skipped by condition)
[=] [cilium-test] Test [health] [43/91]
..
[=] [cilium-test] Skipping test [north-south-loadbalancing] [44/91] (Feature node-without-cilium is disabled)
[=] [cilium-test] Test [pod-to-pod-encryption] [45/91]
.
[=] [cilium-test] Skipping test [pod-to-pod-with-l7-policy-encryption] [46/91] (requires Feature encryption-pod mode wireguard, got disabled)
[=] [cilium-test] Test [node-to-node-encryption] [47/91]
...
[=] [cilium-test] Skipping test [egress-gateway-with-l7-policy] [50/91] (skipped by condition)
[=] [cilium-test] Skipping test [egress-gateway] [48/91] (skipped by condition)
[=] [cilium-test] Skipping test [egress-gateway-excluded-cidrs] [49/91] (Feature enable-ipv4-egress-gateway is disabled)
[=] [cilium-test] Skipping test [pod-to-node-cidrpolicy] [51/91] (Feature cidr-match-nodes is disabled)
[=] [cilium-test] Skipping test [north-south-loadbalancing-with-l7-policy] [52/91] (Feature node-without-cilium is disabled)
[=] [cilium-test] Test [echo-ingress-l7] [53/91]
..................
[=] [cilium-test] Test [echo-ingress-l7-named-port] [54/91]
..................
[=] [cilium-test] Test [client-egress-l7-method] [55/91]
..................
[=] [cilium-test] Test [client-egress-l7] [56/91]
...............
[=] [cilium-test] Test [client-egress-l7-named-port] [57/91]
...............
[=] [cilium-test] Skipping test [client-egress-l7-tls-deny-without-headers] [58/91] (Feature secret-backend-k8s is disabled)
[=] [cilium-test] Skipping test [client-egress-l7-tls-headers] [59/91] (Feature secret-backend-k8s is disabled)
[=] [cilium-test] Skipping test [client-egress-l7-set-header] [60/91] (Feature secret-backend-k8s is disabled)
[=] [cilium-test] Skipping test [echo-ingress-auth-always-fail] [61/91] (Feature mutual-auth-spiffe is disabled)
[=] [cilium-test] Skipping test [echo-ingress-mutual-auth-spiffe] [62/91] (Feature mutual-auth-spiffe is disabled)
[=] [cilium-test] Skipping test [pod-to-ingress-service] [63/91] (Feature ingress-controller is disabled)
[=] [cilium-test] Skipping test [pod-to-ingress-service-deny-all] [64/91] (Feature ingress-controller is disabled)
[=] [cilium-test] Skipping test [pod-to-ingress-service-deny-ingress-identity] [65/91] (Feature ingress-controller is disabled)
[=] [cilium-test] Skipping test [pod-to-ingress-service-deny-backend-service] [66/91] (Feature ingress-controller is disabled)
[=] [cilium-test] Skipping test [pod-to-ingress-service-allow-ingress-identity] [67/91] (Feature ingress-controller is disabled)
[=] [cilium-test] Skipping test [outside-to-ingress-service] [68/91] (Feature ingress-controller is disabled)
[=] [cilium-test] Skipping test [outside-to-ingress-service-deny-world-identity] [69/91] (Feature ingress-controller is disabled)
[=] [cilium-test] Skipping test [outside-to-ingress-service-deny-cidr] [70/91] (Feature ingress-controller is disabled)
[=] [cilium-test] Skipping test [outside-to-ingress-service-deny-all-ingress] [71/91] (Feature ingress-controller is disabled)
[=] [cilium-test] Test [dns-only] [72/91]
...............
[=] [cilium-test] Test [to-fqdns] [73/91]
............
[=] [cilium-test] Skipping test [pod-to-controlplane-host] [74/91] (skipped by condition)
[=] [cilium-test] Skipping test [pod-to-k8s-on-controlplane] [75/91] (skipped by condition)
[=] [cilium-test] Skipping test [pod-to-controlplane-host-cidr] [76/91] (skipped by condition)
[=] [cilium-test] Skipping test [pod-to-k8s-on-controlplane-cidr] [77/91] (skipped by condition)
[=] [cilium-test] Skipping test [local-redirect-policy] [78/91] (Feature enable-local-redirect-policy is disabled)
[=] [cilium-test] Test [pod-to-pod-no-frag] [79/91]
.
[=] [cilium-test] Skipping test [multicast-igmpv2-check] [85/91] (Feature multicast is disabled)
[=] [cilium-test] Skipping test [bgp-control-plane-v2] [81/91] (skipped by condition)
[=] [cilium-test] Skipping test [host-firewall-ingress] [82/91] (skipped by condition)
[=] [cilium-test] Skipping test [host-firewall-egress] [83/91] (skipped by condition)
[=] [cilium-test] Skipping test [cilium-dns-proxy-ha] [84/91] (Feature cilium-dnsproxy-ha is disabled)
[=] [cilium-test] Skipping test [bgp-control-plane-v1] [80/91] (skipped by condition)
[=] [cilium-test] Skipping test [enterprise-bgp-control-plane-v2] [88/91] (Feature enable-enterprise-bgp-control-plane is disabled)
[=] [cilium-test] Skipping test [multicast-igmpv3-check] [86/91] (Feature multicast is disabled)
[=] [cilium-test] Skipping test [multicast-connectivity] [87/91] (Feature multicast is disabled)
[=] [cilium-test] Skipping test [bfd-standalone] [89/91] (Feature enable-bfd is disabled)
[=] [cilium-test] Skipping test [bfd-bgp] [90/91] (Feature enable-bfd is disabled)
[=] [cilium-test] Test [check-log-errors] [91/91]
.
  [.] Action [check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-9kbgf (config)]
  [-] Scenario [check-log-errors/no-errors-in-logs]
  [.] Action [check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-9kbgf (cilium-agent)]
  ❌ Found 1 logs in efa-cluster.us-west-2.eksctl.io/kube-system/cilium-9kbgf (cilium-agent) matching list of errors that must be investigated:
time="2024-09-09T11:55:40Z" level=error msg="failed to obtain eni link list" error="interrupted system call" subsys=ipam (1 occurrences)
.  [.] Action [check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-9kbgf (mount-cgroup)]
.  [.] Action [check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-9kbgf (apply-sysctl-overwrites)]
.  [.] Action [check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-9kbgf (mount-bpf-fs)]
.  [.] Action [check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-9kbgf (clean-cilium-state)]
.  [.] Action [check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-9kbgf (install-cni-binaries)]
.  [.] Action [check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-kbcxt (cilium-agent)]
.  [.] Action [check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-kbcxt (config)]
.  [.] Action [check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-kbcxt (mount-cgroup)]
.  [.] Action [check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-kbcxt (apply-sysctl-overwrites)]
.  [.] Action [check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-kbcxt (mount-bpf-fs)]
.  [.] Action [check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-kbcxt (clean-cilium-state)]
.  [.] Action [check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-kbcxt (install-cni-binaries)]
.  [.] Action [check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-operator-c55d96f4c-6cr97 (cilium-operator)]
.  [.] Action [check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-operator-c55d96f4c-n5hb4 (cilium-operator)]
.
📋 Test Report [cilium-test]
❌ 1/48 tests failed (1/467 actions), 43 tests skipped, 1 scenarios skipped:
Test [check-log-errors]:
  ❌ check-log-errors/no-errors-in-logs/efa-cluster.us-west-2.eksctl.io/kube-system/cilium-9kbgf (cilium-agent)
[cilium-test] 1 tests failed

Install NVIDIA device plugin¶

The NVIDIA Collective Communications Library (NCCL) is a library of standard collective communication routines for multiple GPUs across a single node or multiple nodes. NCCL can be used together with EFA, Libfabric, and MPI to support various machine learning workloads.

Install the NVIDIA device plugin and make sure that all pods are up and running.

kubectl apply -f <https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.yml>

kubectl get pods -o wide -A

NAMESPACE     NAME                                        READY   STATUS    RESTARTS      AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
kube-system   aws-efa-k8s-device-plugin-daemonset-48nms   1/1     Running   0             25m   192.168.91.236   ip-192-168-91-236.us-west-2.compute.internal   <none>           <none>
kube-system   aws-efa-k8s-device-plugin-daemonset-r2p2g   1/1     Running   0             25m   192.168.74.213   ip-192-168-74-213.us-west-2.compute.internal   <none>           <none>
kube-system   cilium-cvt6q                                1/1     Running   0             17m   192.168.91.236   ip-192-168-91-236.us-west-2.compute.internal   <none>           <none>
kube-system   cilium-operator-c55d96f4c-hf7x7             1/1     Running   3 (15m ago)   17m   192.168.91.236   ip-192-168-91-236.us-west-2.compute.internal   <none>           <none>
kube-system   cilium-operator-c55d96f4c-hjw9j             1/1     Running   3 (16m ago)   17m   192.168.74.213   ip-192-168-74-213.us-west-2.compute.internal   <none>           <none>
kube-system   cilium-pjm9z                                1/1     Running   0             17m   192.168.74.213   ip-192-168-74-213.us-west-2.compute.internal   <none>           <none>
kube-system   coredns-5b8cc885bc-8ltgd                    1/1     Running   0             17m   192.168.90.131   ip-192-168-74-213.us-west-2.compute.internal   <none>           <none>
kube-system   coredns-5b8cc885bc-cn425                    1/1     Running   0             17m   192.168.94.57    ip-192-168-91-236.us-west-2.compute.internal   <none>           <none>
kube-system   kube-proxy-6fz6m                            1/1     Running   0             26m   192.168.74.213   ip-192-168-74-213.us-west-2.compute.internal   <none>           <none>
kube-system   kube-proxy-6ttdr                            1/1     Running   0             27m   192.168.91.236   ip-192-168-91-236.us-west-2.compute.internal   <none>           <none>
kube-system   nvidia-device-plugin-daemonset-26tb6        1/1     Running   0             22s   192.168.92.125   ip-192-168-74-213.us-west-2.compute.internal   <none>           <none>
kube-system   nvidia-device-plugin-daemonset-mtnjv        1/1     Running   0             22s   192.168.71.71    ip-192-168-91-236.us-west-2.compute.internal   <none>           <none>

Install the EFA device plugin¶

Add the helm repo and install the EFA device plugin.

helm repo add eks <https://aws.github.io/eks-charts>

helm install efa eks/aws-efa-k8s-device-plugin -n kube-system

Check the status of the daemonset

kubectl get ds -A
NAMESPACE     NAME                                  DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
kube-system   aws-efa-k8s-device-plugin-daemonset   2         2         2       2            2           <none>                            34m
kube-system   aws-node                              0         0         0       0            0           io.cilium/aws-node-enabled=true   41m
kube-system   cilium                                2         2         2       2            2           kubernetes.io/os=linux            26m
kube-system   kube-proxy                            2         2         2       2            2           <none>                            41m
kube-system   nvidia-device-plugin-daemonset        2         2         2       2            2           <none>                            8m52s

Install MPI operator¶

Create a clusterrole-mpi-operator.yaml file

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app: mpi-operator
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
  name: mpi-operator
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - secrets
  - services
  verbs:
  - create
  - list
  - watch
  - update
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - create
  - get
  - list
  - watch
  - delete
  - update
  - patch
- apiGroups:
  - ""
  resources:
  - pods/exec
  verbs:
  - create
- apiGroups:
  - ""
  resources:
  - endpoints
  verbs:
  - create
  - get
  - update
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
- apiGroups:
  - apps
  resources:
  - statefulsets
  verbs:
  - create
  - list
  - update
  - watch
- apiGroups:
  - batch
  resources:
  - jobs
  verbs:
  - create
  - list
  - update
  - watch
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - create
  - get
- apiGroups:
  - kubeflow.org
  resources:
  - mpijobs
  - mpijobs/finalizers
  - mpijobs/status
  verbs:
  - '*'
- apiGroups:
  - scheduling.incubator.k8s.io
  - scheduling.sigs.dev
  resources:
  - queues
  - podgroups
  verbs:
  - '*'
- apiGroups:
  - coordination.k8s.io
  resources:
  - leases
  verbs:
  - get
  - create
  - delete
  - update
- apiGroups:
  - scheduling.k8s.io
  resources:
  - priorityclasses
  verbs:
  - list
  - watch
  - get
  - create
  - delete
  - update

Deploy the mpi-operator

kubectl apply -f clusterrole-mpi-operator.yaml
namespace/mpi-operator created
customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org created
serviceaccount/mpi-operator created
clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-admin created
clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-edit created
clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-view created
clusterrole.rbac.authorization.k8s.io/mpi-operator created
clusterrolebinding.rbac.authorization.k8s.io/mpi-operator created
deployment.apps/mpi-operator created

Check the status of the pods

kubectl get pods -A -o wide
NAMESPACE      NAME                                        READY   STATUS    RESTARTS      AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
kube-system    aws-efa-k8s-device-plugin-daemonset-48nms   1/1     Running   0             39m   192.168.91.236   ip-192-168-91-236.us-west-2.compute.internal   <none>           <none>
kube-system    aws-efa-k8s-device-plugin-daemonset-r2p2g   1/1     Running   0             39m   192.168.74.213   ip-192-168-74-213.us-west-2.compute.internal   <none>           <none>
kube-system    cilium-cvt6q                                1/1     Running   0             31m   192.168.91.236   ip-192-168-91-236.us-west-2.compute.internal   <none>           <none>
kube-system    cilium-operator-c55d96f4c-hf7x7             1/1     Running   3 (29m ago)   31m   192.168.91.236   ip-192-168-91-236.us-west-2.compute.internal   <none>           <none>
kube-system    cilium-operator-c55d96f4c-hjw9j             1/1     Running   3 (29m ago)   31m   192.168.74.213   ip-192-168-74-213.us-west-2.compute.internal   <none>           <none>
kube-system    cilium-pjm9z                                1/1     Running   0             31m   192.168.74.213   ip-192-168-74-213.us-west-2.compute.internal   <none>           <none>
kube-system    coredns-5b8cc885bc-8ltgd                    1/1     Running   0             30m   192.168.90.131   ip-192-168-74-213.us-west-2.compute.internal   <none>           <none>
kube-system    coredns-5b8cc885bc-cn425                    1/1     Running   0             30m   192.168.94.57    ip-192-168-91-236.us-west-2.compute.internal   <none>           <none>
kube-system    kube-proxy-6fz6m                            1/1     Running   0             40m   192.168.74.213   ip-192-168-74-213.us-west-2.compute.internal   <none>           <none>
kube-system    kube-proxy-6ttdr                            1/1     Running   0             41m   192.168.91.236   ip-192-168-91-236.us-west-2.compute.internal   <none>           <none>
kube-system    nvidia-device-plugin-daemonset-26tb6        1/1     Running   0             13m   192.168.92.125   ip-192-168-74-213.us-west-2.compute.internal   <none>           <none>
kube-system    nvidia-device-plugin-daemonset-mtnjv        1/1     Running   0             13m   192.168.71.71    ip-192-168-91-236.us-west-2.compute.internal   <none>           <none>
mpi-operator   mpi-operator-7477b5bdbd-6ccdd               1/1     Running   0             42s   192.168.67.252   ip-192-168-91-236.us-west-2.compute.internal   <none>           <none>

Add lease permissions for the mpi-operator cluster role.

kubectl apply -f ./clusterrole-mpi-operator.yaml
clusterrole.rbac.authorization.k8s.io/mpi-operator configured

Build container image to run tests on EKS¶

Make sure you have docker desktop or docker on your machine from where you are building the image.
Clone the repo.

To run the NCCL tests on EKS, you must build the container image and then push it to a container registry, such as the private ECR in your AWS account.

EFA_INSTALLER_VERSION=1.33.0
AWS_OFI_NCCL_VERSION=v1.9.2-aws
NCCL_VERSION=v2.21.5-1
NCCL_TESTS_VERSION=v2.13.9
ECR_REPOSITORY_NAME="nccl-tests"
TAG="${EFA_INSTALLER_VERSION}-${AWS_OFI_NCCL_VERSION}-${NCCL_VERSION}-${NCCL_TESTS_VERSION}"

aws ecr create-repository --repository-name ${ECR_REPOSITORY_NAME}
{
    "repository": {
        "repositoryArn": "arn:aws:ecr:ap-northeast-2:#################:repository/nccl-tests",
        "registryId": "#################",
        "repositoryName": "nccl-tests",
        "repositoryUri": "#################.dkr.ecr.ap-northeast-2.amazonaws.com/nccl-tests",
        "createdAt": "2024-09-11T20:41:01.443000+05:30",
        "imageTagMutability": "MUTABLE",
        "imageScanningConfiguration": {
            "scanOnPush": false
        },
        "encryptionConfiguration": {
            "encryptionType": "AES256"
        }
    }
}

Get the ECR repository URI.

REPO_URI=\`aws ecr describe-repositories --query repositories[].[repositoryUri] | grep "/${ECR_REPOSITORY_NAME}" | tr -d '"' | xargs\`
ECR_URI=${REPO_URI%"/${ECR_REPOSITORY_NAME}"}

Build the container image

docker image build -t ${REPO_URI}:${TAG} -f ./nccl-tests.Dockerfile .

Once the container image is built, you can check if it is present with docker images. You should see an output similar to this one:

Note- the build time takes roughly 45 minutes

docker image build -t ${REPO_URI}:${TAG} -f ./nccl-tests.Dockerfile .
[+] Building 2676.2s (20/20) FINISHED                                                                                      docker:default
 => [internal] load build definition from nccl-tests.Dockerfile                                                                      0.0s
 => => transferring dockerfile: 5.09kB                                                                                               0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.2.2-devel-ubuntu22.04                                                      2.3s
 => [auth] nvidia/cuda:pull token for registry-1.docker.io                                                                           0.0s
 => [internal] load .dockerignore                                                                                                    0.0s
 => => transferring context: 2B                                                                                                      0.0s
 => CACHED [ 1/15] FROM docker.io/nvidia/cuda:12.2.2-devel-ubuntu22.04@sha256:#####################################################  0.0s
 => [ 2/15] RUN apt-get update -y && apt-get upgrade -y                                                                             65.5s
 => [ 3/15] RUN apt-get remove -y --allow-change-held-packages     ibverbs-utils     libibverbs-dev     libibverbs1     libmlx5-1    1.7s
 => [ 4/15] RUN rm -rf /opt/hpcx     && rm -rf /usr/local/mpi     && rm -f /etc/ld.so.conf.d/hpcx.conf     && ldconfig               0.4s
 => [ 5/15] RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated     apt-utils     autoconf     automake   80.9s
 => [ 6/15] RUN mkdir -p /var/run/sshd                                                                                               0.3s
 => [ 7/15] RUN sed -i 's/[ #]\\(.*StrictHostKeyChecking \\).*/ \\1no/g' /etc/ssh/ssh_config &&     echo "    UserKnownHostsFile /dev/  0.5s
 => [ 8/15] RUN curl <https://bootstrap.pypa.io/get-pip.py> -o /tmp/get-pip.py     && python3 /tmp/get-pip.py     && pip3 install aw  17.4s
 => [ 9/15] RUN git clone -b v2.4.1 <https://github.com/NVIDIA/gdrcopy.git> /tmp/gdrcopy     && cd /tmp/gdrcopy     && make prefix=/o  8.9s
 => [10/15] RUN cd $HOME     && curl -O <https://efa-installer.amazonaws.com/aws-efa-installer-1.34.0.tar.gz>     && tar -xf $HOME/  320.3s
 => [11/15] RUN git clone -b v2.22.3-1 <https://github.com/NVIDIA/nccl.git>  /opt/nccl     && cd /opt/nccl     && make -j $(nproc)  2090.2s
 => [12/15] RUN DEBIAN_FRONTEND=noninteractive apt-get install -y libhwloc-dev                                                       5.4s
 => [13/15] RUN curl -OL <https://github.com/aws/aws-ofi-nccl/releases/download/${AWS_OFI_NCCL_VERSION}/aws-ofi-nccl-$>{AWS_OFI_NCCL  11.3s
 => [14/15] RUN git clone -b v2.13.10 <https://github.com/NVIDIA/nccl-tests.git> /opt/nccl-tests     && cd /opt/nccl-tests     && ma  60.5s
 => [15/15] RUN rm -rf /var/lib/apt/lists/*                                                                                          0.5s
 => exporting to image                                                                                                               9.9s
 => => exporting layers                                                                                                              9.9s
 => => writing image sha256:#####################################################                                         0.0s
 => => naming to #################.dkr.ecr.ap-northeast-2.amazonaws.com/nccl-tests:1.33.0-v1.9.2-aws-v2.21.5-1-v2.13.9                    0.0s

- Login to the container registry

aws ecr get-login-password | docker login --username AWS --password-stdin ${ECR_URI}

- Push the container image to the registry

docker image push ${REPO_URI}:${TAG}
The push refers to repository [#################.dkr.ecr.ap-northeast-2.amazonaws.com/nccl-tests]
241a8f2bba5c: Pushed
f03edb2c7a13: Pushed
7dcd97951db5: Pushed
679e0c0bc4db: Pushed
b45de17a8545: Pushed
62620982ee29: Pushed
a6195e5acb78: Pushed
84c22d0d520d: Pushed
d87b84cc71be: Pushed
80c4373d8c6e: Pushed
7f19e87b6573: Pushed
b8006e828af3: Pushed
0f8e361ccf92: Pushed
401c0edfce5a: Pushed
679cc72df930: Pushed
216b978392cf: Pushed
598c4c0e7d71: Pushed
97da79745aa1: Pushed
4deb6be00699: Pushed
bd4791c0dfbe: Pushed
d3e5eae7a625: Pushed
adf2c0a553e9: Pushed
77b174470c94: Pushed
c735aeb4de50: Pushed
256d88da4185: Pushed
1.33.0-v1.9.2-aws-v2.21.5-1-v2.13.9: digest: sha256:##################################################### size: 5578

Run NCCL tests¶

Verify the number of GPU’s on each node

kubectl describe nodes  |  tr -d '\\000' | sed -n -e '/^Name/,/Roles/p' -e '/^Capacity/,/Allocatable/p' -e '/^Allocated resources/,/Events/p'  | grep -e Name  -e  nvidia.com  | perl -pe 's/\\n//'  |  perl -pe 's/Name:/\\n/g' | sed 's/nvidia.com\\/gpu:\\?//g'  | sed '1s/^/Node Available(GPUs)  Used(GPUs)/' | sed 's/$/ 0 0 0/'  | awk '{print $1, $2, $3}'  | column -t
Node                                          Available(GPUs)  Used(GPUs)
ip-192-168-144-49.us-west-2.compute.internal  4                0
ip-192-168-151-36.us-west-2.compute.internal  4                0

Apply the MPIJob manifest to the cluster

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: test-nccl
spec:
  runPolicy:
    cleanPodPolicy: Running
    backoffLimit: 20
  slotsPerWorker: 1
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
         spec:
          restartPolicy: OnFailure
          containers:
          - image: #################.dkr.ecr.ap-northeast-2.amazonaws.com/nccl-tests:1.33.0-v1.9.2-aws-v2.21.5-1-v2.13.9
            imagePullPolicy: IfNotPresent
            name: test-nccl-launcher
            env:
             - name: LD_LIBRARY_PATH
               value: /opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:/usr/local/nvidia/lib:$LD_LIBRARY_PATH
             - name: PATH
               value: $PATH:/opt/amazon/efa/bin:/usr/bin
            command:
            - /opt/amazon/openmpi/bin/mpirun
            - --allow-run-as-root
            - --tag-output
            - -np
            - "1"
            - -N
            - "1"
            - --bind-to
            - none
            - -x
            - PATH
            - -x
            - LD_LIBRARY_PATH
            - -x
            - FI_PROVIDER=efa
            - -x
          # - FI_EFA_USE_DEVICE_RDMA=1
          # - -x
            - FI_EFA_FORK_SAFE=1
            - -x
            - NCCL_DEBUG=INFO
            - -x
            - NCCL_BUFFSIZE=8388608
            - -x
            - NCCL_P2P_NET_CHUNKSIZE=524288
            - --mca
            - pml
            - ^cm,ucx
            - --mca
            - btl
            - tcp,self
            - --mca
            - btl_tcp_if_exclude
            - lo,docker0,veth_def_agent
            - /opt/nccl-tests/build/all_reduce_perf
            - -b
            - "1"
            - -e
            - "16G"
            - -f
            - "2"
            - -g
            - "1"
            - -c
            - "1"
            - -n
            - "100"
    Worker:
      replicas: 2
      template:
        spec:
          nodeSelector:
            node.kubernetes.io/instance-type: "g5.12xlarge"
          containers:
          - image: #################.dkr.ecr.ap-northeast-2.amazonaws.com/nccl-tests:1.33.0-v1.9.2-aws-v2.21.5-1-v2.13.9
            imagePullPolicy: IfNotPresent
            name: test-nccl-worker
            volumeMounts:
            - name: shmem
              mountPath: /dev/shm
            resources:
              limits:
                nvidia.com/gpu: 1
                hugepages-2Mi: 5120Mi
                vpc.amazonaws.com/efa: 1
                memory: 8000Mi
              requests:
                nvidia.com/gpu: 1
                hugepages-2Mi: 5120Mi
                vpc.amazonaws.com/efa: 1
                memory: 8000Mi
          volumes:
          - name: shmem
            hostPath:
              path: /dev/shm

kubectl apply -f nccl-tests.yaml

Note- the nccl-tests.yaml file from the repo didn’t work and hence made some changes to it.

The following is an example excerpt from the logs of a NCCL all_reduce_perf test, executed on a cluster with two g5.12xlarge instances:

kubectl logs -f $(kubectl get pods | grep launcher | cut -d ' ' -f 1)
Warning: Permanently added 'test-nccl-worker-1.test-nccl.default.svc' (ED25519) to the list of known hosts.
Warning: Permanently added 'test-nccl-worker-0.test-nccl.default.svc' (ED25519) to the list of known hosts.
[1,0]<stdout>:# nThread 1 nGpus 1 minBytes 1 maxBytes 2147483648 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0
[1,0]<stdout>:#
[1,0]<stdout>:# Using devices
[1,0]<stdout>:#  Rank  0 Group  0 Pid     21 on test-nccl-worker-0 device  0 [0x00] NVIDIA A10G
[1,0]<stdout>:test-nccl-worker-0:21:21 [0] NCCL INFO Bootstrap : Using eth0:192.168.148.137<0>
[1,0]<stdout>:test-nccl-worker-0:21:21 [0] NCCL INFO cudaDriverVersion 12040
[1,0]<stdout>:test-nccl-worker-0:21:21 [0] NCCL INFO NCCL version 2.22.3+cuda12.2
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.11.0-aws
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/OFI Using Libfabric version 1.22
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/OFI Using CUDA driver version 12040
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/OFI Configuring AWS-specific options
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/OFI Creating one domain per process
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/OFI Selected Provider is efa (found 1 nics)
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/OFI Could not disable CUDA API usage for HMEM, disabling GDR
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/OFI Setting FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES not supported.
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/OFI Global registrations supported
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Using network AWS Libfabric
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO ncclCommInitRank comm 0x55c4c398fed0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 1e0 commId 0x957c4f4bad7eaafa - Init START
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NET/OFI Global registrations supported
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO comm 0x55c4c398fed0 rank 0 nRanks 1 nNodes 1 localRanks 1 localRank 0 MNNVL 0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 00/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 01/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 02/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 03/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 04/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 05/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 06/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 07/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 08/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 09/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 10/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 11/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 12/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 13/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 14/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 15/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 16/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 17/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 18/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 19/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 20/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 21/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 22/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 23/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 24/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 25/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 26/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 27/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 28/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 29/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 30/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Channel 31/32 :    0
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO NCCL_BUFFSIZE set by environment to 8388608.
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO P2P Chunksize set to 131072
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO ncclCommInitRank comm 0x55c4c398fed0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 1e0 commId 0x957c4f4bad7eaafa - Init COMPLETE
[1,0]<stdout>:test-nccl-worker-0:21:28 [0] NCCL INFO Init timings: rank 0 nranks 1 total 0.17 (kernels 0.12, bootstrap 0.03, allgathers 0.00, topo 0.00, graphs 0.00, connections 0.02, rest 0.00)
[1,0]<stdout>:#
[1,0]<stdout>:#                                                              out-of-place                       in-place
[1,0]<stdout>:#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
[1,0]<stdout>:#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
[1,0]<stdout>:           0             0     float     sum      -1     0.11    0.00    0.00      0     0.10    0.00    0.00      0
[1,0]<stdout>:           0             0     float     sum      -1     0.10    0.00    0.00      0     0.10    0.00    0.00      0
[1,0]<stdout>:           4             1     float     sum      -1     3.54    0.00    0.00      0     0.11    0.03    0.00      0
[1,0]<stdout>:           8             2     float     sum      -1     3.51    0.00    0.00      0     0.11    0.07    0.00      0
[1,0]<stdout>:          16             4     float     sum      -1     3.53    0.00    0.00      0     0.11    0.14    0.00      0
[1,0]<stdout>:          32             8     float     sum      -1     3.48    0.01    0.00      0     0.12    0.28    0.00      0
[1,0]<stdout>:          64            16     float     sum      -1     3.62    0.02    0.00      0     0.12    0.55    0.00      0
[1,0]<stdout>:         128            32     float     sum      -1     3.50    0.04    0.00      0     0.12    1.11    0.00      0
[1,0]<stdout>:         256            64     float     sum      -1     3.50    0.07    0.00      0     0.12    2.22    0.00      0
[1,0]<stdout>:         512           128     float     sum      -1     3.50    0.15    0.00      0     0.12    4.42    0.00      0
[1,0]<stdout>:        1024           256     float     sum      -1     3.49    0.29    0.00      0     0.11    9.01    0.00      0
[1,0]<stdout>:        2048           512     float     sum      -1     3.51    0.58    0.00      0     0.11   17.95    0.00      0
[1,0]<stdout>:        4096          1024     float     sum      -1     3.49    1.18    0.00      0     0.11   35.96    0.00      0
[1,0]<stdout>:        8192          2048     float     sum      -1     3.53    2.32    0.00      0     0.12   70.86    0.00      0
[1,0]<stdout>:       16384          4096     float     sum      -1     3.49    4.70    0.00      0     0.11  143.72    0.00      0
[1,0]<stdout>:       32768          8192     float     sum      -1     3.57    9.18    0.00      0     0.12  282.73    0.00      0
[1,0]<stdout>:       65536         16384     float     sum      -1     3.58   18.30    0.00      0     0.12  565.45    0.00      0
[1,0]<stdout>:      131072         32768     float     sum      -1     3.52   37.19    0.00      0     0.12  1129.93    0.00      0
[1,0]<stdout>:      262144         65536     float     sum      -1     3.58   73.13    0.00      0     0.12  2267.68    0.00      0
[1,0]<stdout>:      524288        131072     float     sum      -1     4.18  125.28    0.00      0     0.12  4534.97    0.00      0
[1,0]<stdout>:     1048576        262144     float     sum      -1     6.48  161.76    0.00      0     0.13  8236.40    0.00      0
[1,0]<stdout>:     2097152        524288     float     sum      -1    11.21  187.15    0.00      0     0.12  18092.93    0.00      0
[1,0]<stdout>:     4194304       1048576     float     sum      -1    21.11  198.71    0.00      0     0.12  36314.32    0.00      0
[1,0]<stdout>:     8388608       2097152     float     sum      -1    38.56  217.57    0.00      0     0.12  72817.78    0.00      0
[1,0]<stdout>:    16777216       4194304     float     sum      -1    72.82  230.40    0.00      0     0.12  145006.19    0.00      0
[1,0]<stdout>:    33554432       8388608     float     sum      -1    141.3  237.54    0.00      0     0.12  289262.34    0.00      0
[1,0]<stdout>:    67108864      16777216     float     sum      -1    278.1  241.31    0.00      0     0.12  580024.75    0.00      0
[1,0]<stdout>:   134217728      33554432     float     sum      -1    552.0  243.15    0.00      0     0.12  1157049.38    0.00      0
[1,0]<stdout>:   268435456      67108864     float     sum      -1   1099.5  244.14    0.00      0     0.12  2296086.36    0.00      0
[1,0]<stdout>:   536870912     134217728     float     sum      -1   2194.7  244.62    0.00      0     0.12  4600436.26    0.00      0
[1,0]<stdout>:  1073741824     268435456     float     sum      -1   4376.7  245.33    0.00      0     0.12  9240463.20    0.00      0
[1,0]<stdout>:  2147483648     536870912     float     sum      -1   8741.9  245.65    0.00      0     0.12  18401745.06    0.00      0
[1,0]<stdout>:test-nccl-worker-0:21:21 [0] NCCL INFO comm 0x55c4c398fed0 rank 0 nranks 1 cudaDev 0 busId 1e0 - Destroy COMPLETE

References¶

Try out Cilium¶

Try out Cilium and get a first-hand experience of how it solves some real problems and use-cases in your cloud-native or on-prem environments related to Networking, Security or Observability.

🌟Conclusion 🌟¶

Hopefully, this post gave you a good overview how to install Cilium on an EKS cluster with the EFA adapter and run NCCL Tests to evaluate the performance of the network using the Nvidia Collective Communication Library.

Thank you for Reading!! 🙌🏻😁📃, see you in the next blog.

🚀 Feel free to connect/follow with me/on:

LinkedIn: linkedin.com/in/agamitgupta

Last published 5 hours ago

Best AWS, DevOps, Serverless, and more from top Medium writers.

More from Amit Gupta and AWS Tip¶

Recommended from Medium¶

[

See more recommendations

](https://medium.com/?source=post_page---read_next_recirc--249987d043eb---------------------------------------)