We all know that using the Cloud Native method is the most effective way to manage Web Applications on a large scale. With the help of both public and private clouds, we have addressed all possible issues such as hardware availability, ability to grow, networking, storage, and managing multiple users, which may arise with these web applications.
AI workloads also have similar concerns and if we can just align our efforts to make them as the first class citizen of the Cloud Native, then it can be adopted at scale with minimal efforts. It is not that we can not deploy AI/ML workloads on Cloud Native tech stack with Kubernetes not but there is a lot of room improvement.
To list down the current state of things, challenges and path forward the Cloud Native Computing Foundation(CNCF) had published a whitepaper few months back – Cloud Native Artificial Intelligence in KubeCon EU’24. During our Kubernetes Club sessions in March-April; we went over the paper and discussed all the aspects in great details. Following are the recordings of the same.
Unfortunately last session was not recorded completely.
If I have to highlight one of the biggest challenge of running AI/ML workload on Kubernetes right now, then it would be scheduling. For example how you can run schedule a set of related jobs that must run together for completion, optimal utilisation of the GPUs & other resources. Support for vGPUs, MIG, MPS and Dynamic Resource Allocation (DRA) on Kubernetes are already making things better.
Acquisition of Run:ai by Nvidia, which built a Kubernetes-based GPU orchestrator highlights importance of scheduling and bright future of Cloud Native and AI.
If you are just starting up with AI/ML and Kubernetes then, our on-going book club sessions on Machine Learning on Kubernetes are a great place to start.