How to spin up Kubeflow v1.x and later in AWS EKS

Cloud Managed Image

Skip to the middle part of the article if you need commands to create AWS EKS cluster and install Kubeflow v1.x later.

MLOps has been on upward trends for years and the companies are confronting issues to establish pipelines and automations for the ML lifecycle in efficient way. In this article, I’ll look through MLOps overview and introduce a useful tool Kubeflow that we can cover a part of MLOps.

At first, the focal points seem a model generation, an inference deployment and an orchestration for ML engineers or Data engineers how to automate those elements in MLOps and create valuable services for users and application that invoke generated models.

MLOps applies to the entire lifecycle — from integrating with model generation (software development lifecyclecontinuous integration/continuous delivery), orchestration, and deployment, to health, diagnostics, governance, and business metrics.

MLOps – Wikipedia

If we focus on only technical side components in MLOps, it mainly consists of …

  1. Deployment and automation
  2. Reproducibility of models and predictions
  3. Diagnostics
  4. Scalability

OK this might look vast, some people defined MLOps is what engineer need to do except a model and algorithm development and business related works, so it still sounds vague. However Kubeflow was developed to tackle with most of engineering tasks in MLOps in Kubernetes environment at a glance.

It’s worth to take a look Kubeflow Overview why this tool and toolkit can fit into MLOps development and operations. There are other cloud-based solutions to do similar works such as Amazon Sagemaker that is AWS based machine learning platform that enables developers to build machine learning models, train data, deploy an inference point in AWS cloud.

Amazon Sagemaker provides some unique services that neither other cloud-based solution nor Kubeflow can do, Ground Truth, SageMaker Studio (IDE), so you have to compare each tool and try to do PoC which tool would fit in your organization’s environment.

Kubeflow components

Kubeflow is an open source (Kubernetes is also open source) dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. The basic capabilities of workflow tools or MLOps tools are similar, but if you want to use Kubernetes and open source tool, Kubeflow will be your choice.

Kubeflow supports workflows, including managing data, running notebooks, training and developing models and serving models, with composability, portability and scalability. Here are some major components Kubeflow provides.

  • Kube UI – Central dashboard and web UI provide to access Kubeflow resources, such as notebooks, pipelines, jobs, etc.
  • Jupyter notebooks – spawns notebook server in Kubeflow and generate notebooks in servers with specific namespace.
  • Pipelines – builde and deploy portable and scalable end-to-end ML workflows, based on containers.
  • KFServing and Seldon Core – model serving components that are open source systems that allow multi-framework model serving.
  • Fairing – streamlines the process of building, training, and deploying machine learning (ML) training jobs.

Kubeflow versioning follows the semantic versioning. Since version 1.x the Kubeflow community attributes stable status to those applications and components that meet a defined level of stability, supportability, and upgradability.

Application version is independent of Kubeflow versioning, you need to confirm application version and status of Kubeflow version you will install in your environment. There are 3 statuses in application stable, beta and alpha. You can check application statuses in here.

Create a Kubernetes cluster and install Kubeflow

Kubeflow can run on Kubernetes cluster, and once your cluster is set up you can move forward to installation of Kubeflow. You must have 3 tools installed on your console, eksctl, kubectl, kfctl. Here is the step.

  1. Set up a Kubernetes cluster in AWS EKS
  2. (Apply the Nvidia Kubernetes device plugin if needed)
  3. Set up environment variables and download a configuration
  4. Install Kubeflow with the configuration file

It’s recommended using GPU supported instances to improve performance in the training of deep neural networks and processing large data sets while it’s possible to run machine learning workloads with CPU instances.

AWS tutorial uses 2 x p3.8xlarge instances with nvidia-device-plugin to support GPU. In this article, I used just 2 x t2.micro without GPU for spinning up a Kubernetes cluster. Because I don’t test resource intensive works in this cluster.

Set up a Kubernetes cluster in AWS EKS

I used a simple launch template to set up a Kubernetes cluster with 2 x t2.micro EC2 instances.

$ eksctl create cluster -f kubeflow-eks-sample.yaml

Confirm the nodes and the storage class have been created successfully.

$ kubectl get node
NAME                                         STATUS   ROLES    AGE   VERSION
ip-x-x-x-x.ap-northeast-1.compute.internal   Ready    <none>   18m   v1.14.7-eks-1861c5
ip-x-x-x-x.ap-northeast-1.compute.internal   Ready    <none>   18m   v1.14.7-eks-1861c5

$ kubectl get storageclass
NAME            PROVISIONER             AGE
gp2 (default)   kubernetes.io/aws-ebs   24m

(Apply the Nvidia Kubernetes device plugin if needed)

If you want to apply nvidia-device-plugin for your worker loads to support GPU, here’s the result I tested with 2 x p3.8xlarge hosts by following the tutorial. The yaml file generates daemonset as below.

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.12/nvidia-device-plugin.yml
daemonset.extensions/nvidia-device-plugin-daemonset-1.12 created

$ kubectl get daemonset -n kube-system
NAME                                  DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
aws-node                              2         2         2       2            2           <none>          15m
kube-proxy                            2         2         2       2            2           <none>          15m
nvidia-device-plugin-daemonset-1.12   2         2         2       2            2           <none>          2m30s

$ kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu,EC2:.metadata.labels.beta\.kubernetes\.io/instance-type,AZ:.metadata.labels.failure-domain\.beta\.kubernetes\.io/zone"
NAME                                                GPU   EC2          AZ
ip-192-168-27-229.ap-northeast-1.compute.internal   4     p3.8xlarge   ap-northeast-1c
ip-192-168-73-148.ap-northeast-1.compute.internal   4     p3.8xlarge   ap-northeast-1a

Set up environment variables and download a configuration

If you’ve had kfctl you need to set up environment variables that are used in Kubeflow yaml file, ${CONFIG_URI} for the path of the downloaded configuration yaml, and ${AWS_CLUSTER_NAME} for the deployed Kubernetes cluster.

This shell script sets these environment variables and download the configuration yaml file under the directory of the name of EKS cluster. I used the name “kubeflow-sample” for my EKS cluster as below.

$ chmod +x kubeflow-test.sh
$ ./kubeflow-test.sh

$ ls -al kubeflow-sample/
-rw-rw-r-- 1 user user 3096 Oct 1 10:00 kfctl_aws.yaml

Since v1.0.1, Kubeflow supports to use AWS IAM Roles for Service Account to fine grain control AWS service access. There are 2 roles and the 2 corresponding service accounts in Kubeflow created when you install Kubeflow. 2 IAM roles will be equivalent directly to 2 service accounts kf-admin and kf-user under Kubeflow’s namespace in default. This Kubeflow’s namespace is not the EKS cluster’s namespace, there is a namespace also in Kubeflow application for multi-user separation purpose.

You can confirm this setting at the bottom of the yaml file. enablePodIamPolicy: true is the attribute set for the option 1: USE IAM role for Service account. Or you can use AWS EKS worker node’s IAM role for the option 2: Use Node group role. The documentation explains how to proceed to the option 2 with the worker node’s IAM role, and how to investigate the role’s arn in command.

Configure Kubeflow

At last, you need to change the region value in the configuration yaml file before deploying Kubeflow application.

Install Kubeflow with the configuration file

It’s time to deploy Kubeflow with kfctl tool. If you haven’t installed it yet, here’s the link for further information. Before running kfctl command, there were 2 tricks that were needed to complete Kubeflow set-up in my environment. Actually these are not clearly mentioned in the document of the installation however some information were available in github issue.

  1. Replace AWS EKS cluster name in the yaml file
  2. Export AWS_REGION environment variable
$ sed -i'.bak' -e 's/kubeflow-sample/'"$AWS_CLUSTER_NAME"'/' kfctl_aws.yaml
$ export AWS_REGION=ap-northeast-1

then,

$ kfctl apply -V -f kfctl_aws.yaml
...

INFO[0100] Applied the configuration Successfully!       filename="cmd/apply.go:72"

The deployment may take 3-5 minutes to become ready. The internet-facing FQDN will be available as an ingress service. You can hit the FQDN in ADDRESS on your browser. You’ll be prompted to set up the initial configuration.

$ kubectl get ingress -n istio-system
NAME            HOSTS   ADDRESS
istio-ingress   *       *-istiosystem-istio-*.ap-northeast.*.com

Kubeflow UI gets available to use. I created a notebook server and opened a new notebook for test. You can devide users into multi-tenancy for user separation purpose for Kubeflow resources. But there are some limitations about multi-user isolation depending on each application in Kubeflow.

Current integration and limitations

To wrap up, Kubeflow is a nice choice if you’re looking for an open source tool to support MLOps components (not full fledged but technically fine) in ML/AI lifecycle. 3 pillars in Kubeflow are composability, portability and scalability, which is beneficial to organization and companies.

Then leverging cloud managed services such as AWS EKS is essential to spin up Kubernetes cluster easily and avoid managing and maintaining the underlyring Kubernetes infrastructure. Once it’s up people can focus on managing data, running notebooks, training and developing models and serving models in Kubeflow application.

Leave a Reply

Your email address will not be published. Required fields are marked *