Architecting Amazon EKS for high availability and resilience for running medium-sized workloads

Sitemap

Kubernetes is a powerful container orchestration platform that handles complexities of deploying, scaling, and workload distribution of microservices in a containerized environment. Key features offer resiliency, high availability, and unparalleled efficiency in running your distributed workloads. Container orchestration through Kubernetes is one of the most popular ways to run and manage large and medium sized workloads for microservices. Many engineering teams across organizations have investigated k8s for running workloads over the last few years. As per Cloud Native Computing Foundation, between 2016 and 2023, the count of individual contributors to the Kubernetes project rose from 2,727 to over 76,035 today, representing a 2,688% increase (Project Journey Report, 2023).

Kubernetes blends well with the microservices paradigm, provides vendor agnostic hosting capabilities, and most importantly — amazing stability and availability features. With the latest advances and improved documentation, you may face little roadblocks in moving to a Kubernetes platform. However, it will take some time for your cluster to mature and to understand the intricate needs of your workload.

Two immediate benefits you will notice are the ease of application deployment rollouts and the resiliency of applications. However, these advantages come with a steep learning curve and need-specific configuration and tweaking of your cluster. Smaller clusters need changes even after initial capacity planning and with cost-optimization in mind. In this article, we discuss some of the resiliency and High-Availability specific EKS (Amazon Elastic Kubernetes Service) configuration after two years of experience and the course of cluster maturation.

Distinct node types for improved capacity planning (and resiliency)

“All services are equal, but some are more equal than others”: Kubernetes efficiently handles managing diverse types of heavy and lightweight workload on a single cluster. You may have a cluster that serves a few requests per day to thousands per second all running on the same cluster. Some workloads require copious amounts of memory, others do heavy number crunching and need a higher number of CPU cores available. For example, some apps perform best when provided with a large amount of RAM and a higher core count than if they were scaled horizontally (creating more replicas). The constraints of a mid-size cluster’s self-imposed budget limitations, capacity planning, and ensuring optimized node/pod ratios (tight bin packing) are of the utmost importance.

To reduce over-provisioning the cluster in any dimension while still maintaining higher-spec nodes, consider three different node group types: compute-ng, memory-ng and infrastructure-ng. The nodes in compute-ng type are higher spec’d and suitable for running pods that directly benefit from a higher CPU core count. infrastructure-ng runs tools/add-ons and helm charts to efficiently run clusters such as ingress-nginx, external-dns or argo-cd for example. memory-ng is for apps that have higher memory requirements. For example: a batch-loader backend runs on a memory-ng node group as the workload is not time-sensitive and it needs enough memory capacity to handle enormous amounts of data.

While this may be an anti-pattern for larger clusters, where it is recommended to use tools like karpenter to dynamically adjust the node pool from different types, you should distinguish between higher-spec node groups and lower-spec node groups for less frequently used apps to manage cost while still maintaining performance for apps that require it. You could use karpenter/cluster-autoscaler to choose between only higher-spec’ed node types for a particular node group, but the learning curve makes it difficult to add a new layer of complexity.

Distributed node groups for blue-green deployments.

Notice the distribution of each node group type (i.e. compute-ng, memory-ng and infrastructure-ng) into two separate node groups. For example, compute-ng is created as a pair of compute-ng-1 and compute-ng-2 node groups. This allows us to run blue-green deployments of node groups by independently scaling up one of the pair of node groups before changing the other in the pair. Using two node groups allows you to perform updates manually during business hours with zero-downtime. With EKS managed node groups that use ASGs (Auto-Scaling Groups) to create nodes, you will ensure a good mix of nodes in every AWS (Amazon Web Services) Availability Zone (AZ) in each region.

I recommend attaching the same label to both pairs of node groups. For example, all the nodes in compute-ng-1 and compute-ng-2 node groups will have the xxx-node-group-type label with value compute-node. This makes it easier to schedule pods into nodes via the label.

Node affinity and pod anti-affinity

One of the many excellent features of Kubernetes is its ability to control pod placement in the cluster. I recommend using both node affinity and pod anti-affinity to control the scheduling of pods onto specific nodes and to maintain high availability. In case of a node failure — you can avoid having all the replicas of pods for a service on the same node in case the node goes down.

Node Affinity

As previously mentioned, you should distinguish between compute-intensive apps and non-compute intensive apps. Using node-affinity rules will give you control to schedule apps on the appropriate node group (compute-ng/memory-ng/infrastructure-ng). So, every Deployment (as well as CronJob / StatefulSet) has a node-affinity rule in the.spec.affinity.nodeAffinity field.

Take notice of requiredDuringSchedulingIgnoredDuringExecution instead of preferredDuringSchedulingIgnoredDuringExecution. In this example, we run with around 10 nodes and a high number of pod replicas. This gives confidence to mandate “hard” scheduling requirements. A better approach would be to mandate both “hard” and “soft” scheduling requirements for critical deployments and clusters that need high availability with a lower node count. The modified file would look like:

This will ensure Kubernetes prefers compute nodes but would schedule on a memory node if there are no compute nodes available.

TIP: Adjust the weight depending on the “criticality” of the deployment. A more important workload should be assigned a higher weight.

Pod anti-affinity rules

Pod anti-affinity rules allow you to control the spread of pods across the nodes. Employ this approach for scheduling replicas across nodes and failure zones to ensure a single node refresh or outage in a specific AZ does not impact all pod replicas. This is particularly crucial for pods with a lower replica count, as they have a higher probability of having all replicas scheduled on the same node. A sample pod anti-affinity rule from one of your deployments may looks like this:

The topologyKey parameter determines the labels according to which the pods will be scheduled, while still meeting the criteria specified in the matchExpressions rule. Kubernetes would “prefer” to schedule your pods onto nodes with different values of topology.kubernetes.io/zone. Since this label will vary as per different AZ, Kubernetes will try to ensure that your pods are spread evenly. You can use other well-known labels (the full list can be found here) to distribute your workload pods using different criteria. The topology.kubernetes.io/zon e with higher weight ensures that the pods are evenly distributed across different AZs (Availability Zone), whereas kubernetes.io/hostname with a lower weight (and hence, lower priority) ensures that the pods are distributed across nodes with different hostnames, regardless of whether they are located within the same AZ or across different AZs.

NOTE: Pod scheduling (especially “soft” scheduling) is a complicated process and (anti) affinity rules are just one of the many functions that Kubernetes looks at. To schedule a pod, Kubernetes generates a weighted node score using priority functions like SelectorSpread, LeastRequested, NodeAffinity, PodAffinity etc and assigns a pod to the node with the highest score. (for reference purposes, default weight is always 1)

PreStop hooks and deploying in the batches of two for larger pods counts.

Getting preStop hooks right was crucial for code releases and patches during business hours. The preStop hook is a container lifecycle hook that is run before a container is terminated (i.e., after the TERM signal is received by the pod and before it is propagated to the container itself). The pod will stop receiving requests the moment the preStop hook is executed. This provides a way to gracefully handle any in-flight requests that might currently be being processed by the container before it is terminated.

Example of a 20 second sleep preStop hook:

It stops receiving new requests to the pod and gives 20 seconds before the webserver running in the container is sent SIGTERM.

TIP: If your preStop hook is longer than 30 seconds especially for long running batch jobs, consider adding a terminationGracePeriodSeconds value greater than the sleep time of preStop hook. Otherwise, the pod will get terminated by Kubernetes in 30 seconds.

Since our preStop hook delays the pod update during a release process, for deployments that have a high replica count, do RollingUpdates in batches of two instead of one. This approach is adopted to ensure a shorter duration for code deployment.

The figure below shows one such example of a patch release during business hours without a single request failure (status code 5xx) during or after the deployment. Note that all 100+ pods were refreshed in the process.

Protect critical workloads with Pod Disruption Budget (PDB) and priorityClass

A Kubernetes PDB (Pod Disruption Budget) defines the level of “voluntary” disruption an app can tolerate while maintaining a baseline performance. Having a correct PDB for critical applications is crucial for successfully managing node operations such as scale downs and updates without service interruptions. You can define it as a percentage, an absolute value or in terms of minimum available or maximum unavailable.

A sample PDB:

To handle unforeseen and “involuntary” pod evictions, critical apps must also be given a higher priorityClass. By appropriately assigning the correct priority class, Kubernetes can effectively determine which pods with lower priority should be evicted in the event of resource constraints on a node. You can define your own priorityClass if you are hosting multiple applications with different priorities or use the pre-configured system-cluster-critical/system-node-critical for your extremely important workloads.

CAUTION: While it is tempting to assign a high pdb and high priorityClass to all the workloads, some care must be taken to not “over-do” it. A high enough pdb can prevent node updates leading to degraded/non-secure nodes over time. Also, if every workload has a high priority class, Kubernetes may — in a pinch -end up evicting helper/auxiliary pods and that may break your cluster.

NodelocalDNS

NodelocalDNS is a DNS (Domain Name System) caching agent that improves the performance of DNS queries originating from a pod. In Kubernetes, the pods in a cluster make a call to kube-dns for DNS resolution queries. NodelocalDNs sets up a DNS cache on every node thereby avoiding extra calls and DNAT rule evaluations. In case of a cache-miss, the nodelocaldns will query kube-dns and return the result.

Interestingly, the reason to prioritize nodelocaldns is not in a quest to optimize DNS query performance, but rather due to a Linux kernel bug that was causing random DNS queries to time out after 5 seconds. Simple DNS query resolution like connecting to an Aurora PostgreSQL database underperformed. And while this conntrack-race Linux kernel bug had been identified and marked as resolved with the fix rolled out on the kernel version our nodes were running on, our team was still experiencing this issue (original GitHub issue can be seen here).

Setting up nodelocalDNS as easy, landing on the correct DNS config for our pods — however, was a process. The final config that we landed on was:

NodelocalDNS fixed the 5 second DNS query timeout bug, but as an extra bonus it also improved DNS query performance from milliseconds to microseconds. (If interested, we tried several approaches before finalizing the current configuration. You can check them out at 1, 2 and 3)

The image above shows the 99.99 percentile response time for major networking calls in one of our clusters before and after NodeLocalDNS

Liveness probes and Readiness probes with deep health checks

Liveness and readiness probes can be used to manage (and properly serve traffic) unresponsive workloads in the cluster. Liveness probe helps detecting situations where the pod has become unresponsive or has gone into a broken state. Upon receiving a non 2xx or 4xx response code from the HTTP endpoint, liveness probe kills and restarts the pod. In cases where a pod is alive but is handling large data on startup, you should not send requests to the pod till the processing is complete. Readiness probes indicate that a pod is not ready to receive traffic.

Combining Kubernetes probes with deeper health checks improves the resiliency of the underlying pod. A deep health check monitors the health of not only one’s own service but also critical upstream external dependencies such as database connectivity. The Kubernetes health check endpoints can hit your service’s health endpoint which performs deep health check including available memory and ping to RDS database. (Note that deep health check endpoints must be built by application developers and should only check upstream external dependencies that are critical for the pod to handle requests).

Conclusion

High availability and resiliency are not merely a luxury, but a necessity. The orchestration power of Kubernetes allows us to create resilient and highly available applications to gracefully handle traffic fluctuations, load distributions, AZ failures and business-hour deployments.

However, it is crucial to remember that this high availability and resiliency in Kubernetes is not a one-size-fits-all undertaking. A deep understanding of the requirements and customization of the cluster for your workload type is also extremely important.

This article presents some cluster, node, and application design ideas for running small and medium sized workloads efficiently and cost-effectively in an EKS cluster. You do not need to compromise on resiliency or shy away from doing business-hour deployments to maintain high uptime all the while keeping cost into consideration. These ideas and tips surely can help you maximize the potential of your Kubernetes cluster.

Author: Rohit Raj(Lead Software Engineer)

Editor: Josh Gaddy (VP, Director, Developer Advocacy)

FactSet delivers data, analytics, and open technology in a digital platform to help the financial community see more, think bigger, and do their best work.

More from FactSet¶

Recommended from Medium¶

[

See more recommendations

](https://medium.com/?source=post_page---read_next_recirc--5ce01e74838a---------------------------------------)