AutoScaling in Kubernetes

31 March 2022

To know about autoscaling and its types in Kubernetes

Allocating resources to pods running inside the Kubernetes cluster is challenging as it gives rise to the questions such as how much CPU and RAM to allocate to pods for high performance and how to create enough replicas of these pods to handle the incoming load. For this, Kubernetes comes with a powerful feature called autoscaling.

In this hands-on lab, we will be going to see about different types of autoscaling and do enable autoscaling of pods through metrics server.

Lab Setup

You can start the lab setup by clicking on the Lab Setup button on the right side of the screen. Please note that there are app-specific URLs exposed specifically for the hands-on lab purpose.

Our lab has been set up with all necessary tools like base OS (Ubuntu), developer tools like Git, Vim, wget, and others. 


Autoscaling is one of the important features in a Kubernetes cluster which helps in increasing or decreasing the number of pods or nodes according to the demand for service responses to it.

This helps in improving the overall resource utilization of the cluster by automatically adjusting the application resources and pods according to the load at a time and thus avoiding many manual tasks. 

This autoscaling uses two types of mechanisms:

  • Pod-based scaling - To automate the scaling of pods through Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) methods.
  • Node-based scaling - To automate the cluster node scaling through the Cluster Autoscaler(CA) method.

Horizontal Pod Autoscaler (HPA)

Horizontal Pod Autoscaler (HPA) scales the number of pod replicas automatically of workload resources such as Deployments/StatefulSets to match the workload demand. Horizontal scaling means that if the load gets increased then HPA instructs workload resources to deploy more pods(i.e. to scale up), and similarly, if the load gets decreased and the number of pods is already above the minimum configuration then HPA instructs workload resources to scale down.

Horizontal Pod Autoscaling
Figure 1: Horizontal Pod Autoscaling

Vertical Pod Autoscaler (HPA)

Vertical Pod Autoscaler (VPA) provides dynamic provisioning of compute resources (CPU and memory) to workload resources such as Deployments/StatefulSets based on the analysis of metrics collection from these workloads and automatically updates them so that the cluster resources are used efficiently.

Vertical Pod Autoscaling
Figure 2: Vertical Pod Autoscaling

Cluster Autoscaler (CA)

Cluster Autoscaler (CA) helps in maintaining the size of the Kubernetes cluster by dynamically adding or removing the nodes of the cluster based on the node utilization metrics and the number of pending pods that could not get scheduled due to shortage of resources.    

Figure 3: Cluster Autoscaling
Figure 3: Cluster Autoscaling

As of now, we have discussed types of autoscaling in the Kubernetes cluster. In the next section, we will be seeing the working of Horizontal Pod Autoscaler as to what basis it scales the pod.

Working of Horizontal Pod Autoscaler

Horizontal Pod Autoscaling can work with both stateful and stateless applications but it can not work with Daemonsets as it can’t be scaled. This HPA is implemented as a Kubernetes API resource and controller.

This resource helps the controller running inside the Kubernetes control plane to adjust the desired scale of workload resources timely by matching the observed metrics such as average CPU utilization, average memory utilization, or any other custom metrics.

In the earlier Kubernetes version, Heapster was used as a metrics collector but due to its limited functionality it’s not in much use, later on, metrics API and metrics-server were introduced for metrics collection which can collect metrics from Kubernetes objects and can collect metrics on the number of HTTP requests.

By default, HPA collects metrics through metrics API, and the most commonly used is resource metrics which is implemented through API provided through metrics-server.

HPA working with Metrics Server
Figure 4: HPA working with Metrics Server

Kubernetes implements the HPA in a control loop through the Kube-Controller manager by setting an interval of 15 seconds through --horizontal-pod-autoscaler-sync-period parameter.

In each of these periods, the Kube-controller-manager makes the resource utilization query against the metrics specified in the HPA definition and finds the target to be resourced through the scaleTargetRef field, and obtains the metrics either from resource metrics API or through custom metrics API.

HPA definition has a threshold value of CPU or memory utilization of workload and metrics are collected from metrics server and if the utilization is above the threshold then HPA scales up the pod and if it is less then it scales down.

The way HPA calculates a number of replicas is through the following formula

Copy Code

For example, if the current metric value is 400m, and the desired value is 100m, the number of replicas will be four times since 400.0 / 100.0 == 4.0

When a targetAverageValue or targetAverageUtilization is specified, the currentMetricValue is calculated by taking the average of the given metric across all the pods in the HorizontalPodAutoscaler's scale target.

In the next section, we will be going to deploy a python web app through a NodePort service and then will increase the load on the app through locust and scale the pods through HPA.

Horizontal Pod Autoscaler LAB

As we triggered the lab through the LAB SETUP button, a terminal, and an IDE comes for us which already have a Kubernetes cluster running in them. This can be checked by running the kubectl get nodes command.

  • Deploy the front end of the application by creating a deployment and exposing it through a service.
Copy Code
kubectl apply -f frontend.yaml
Note: Remember to add resources (line 19) attributes for which you want to collect metrics using metrics-server and scale it using HPA.
  • Create the backend of the app by creating its deployment and exposing it through service.
Copy Code
kubectl apply -f backend.yaml
kubectl get pods,svc
  • To access the application through browser deploy the ingress for it
Copy Code
kubectl apply -f ingress.yaml
kubectl get ingress

Now, access the app through the app-port-80 URL under the LAB-URL section and will get an rsvp app like shown in the image below.

Figure 5: rsvp app
Figure 5: rsvp app
  • Then to collect metrics through metrics server, configure it through the following 
Copy Code
kubectl apply -f components.yaml

Check the metrics-server pod in the kube-system namespace

kubectl get pods -n kube-system
  • Check the resource utilization of the pods by running the following command 
kubectl top pod --namespace default
  • Now to increase the load and usage on the app, install locust through pip, and install Flask as a prerequisite for locust.
sudo apt update
apt install python3-pip -y
pip install flask
pip install locust
  • Create a locustfile for load testing
Copy Code
locust -f --host <APP_URL> --users 500 --spawn-rate 20 --web-port=8089

Here, replace <APP_URL> with the rsvp app URL and access the locust UI from the app-port-8089 URL under the lab URL section and will see a locust UI as shown in the image below.

Figure 6:Locust UI
Figure 6:Locust UI

Click the Start swarming button in the locust UI to enable the load on the rsvp app and will see an output like below

Figure 7:Locust UI after starting load on rsvp app
Figure 7:Locust UI after starting load on rsvp app
  • To enable scaling of pods, create a Horizontal Pod Autoscaler for the rsvp deployment (frontend.yaml)
Copy Code

Here, inside scaleTargetRef (line 9) the kind of resource and its name has to be mentioned on which HPA has to be applied.

NOTE: In the rsvp deployment, we have specified CPU limits and requests and in this hpa object target value of CPU utilization is 20% . So as soon as the CPU utilization hits 20% or greater than it, scaling will take place.
kubectl apply -f hpa.yaml
kubectl get hpa

As soon as the load starts getting increased on the app, HPA will start working and will scale up the pods.

kubectl top pod --namespace default
kubectl get hpa
kubectl get pods

What Next?

As we have seen scaling the application on the basis of CPU metrics through HPA. In the next hands-on lab, we will be scaling the same application with KEDA, a Kubernetes Event-Driven Autoscaling.


In this hands-on lab, we saw about autoscaling in Kubernetes and learned about the working of HPA and how to implement it.

About the Author

Oshi Gupta

Oshi Gupta

DevOps Engineer & Technical Writer, CloudYuga

Oshi Gupta is a final year undergraduate student and currently working as an Intern at CloudYuga. She is working on Kubernetes and different cloud-native technologies. She also has been a student mentor for the Google Cloud Career Readiness program.