Build a Machine Learning System Using Kubernetes

By Che Yang, nicknamed Biran At Alibaba.

This is part of a series of articles that uses Alibaba Cloud Container Service as an example to help you get started with Kubeflow Pipelines.

The engineering associated with machine learning is complex due to common software development problems and the data-driven features of machine learning. As a result, the workflow becomes longer, data versions are out of control, experiments cannot be easily traced, results cannot be conveniently reproduced, and it is costly to iterate the model. To resolve these inherent issues in machine learning, many enterprises have built internal machine learning platforms to manage the machine learning lifecycle, such as Google's TensorFlow Extended platform, Facebook's FBLearner Flow platform, and Uber's Michelangelo platform. However, these platforms depend on the internal infrastructure of these enterprises. This means that they cannot be completely open-source. These platforms use the machine learning workflow as the framework. This framework enables data scientists to flexibly define their own machine learning pipelines and reuse existing data processing and model training capabilities to better manage the machine learning lifecycle.

Google has extensive experience in building machine learning workflow platforms. Its TensorFlow Extended platform supports Google's core businesses such as search, translation, and video playback. More importantly, Google has a profound understanding of engineering efficiency in the machine learning field. Google's Kubeflow team made Kubeflow Pipelines open-source at the end of 2018. Kubeflow Pipelines is designed in the same way as Google's internal TensorFlow Extended machine learning platform. The only difference is that Kubeflow Pipelines runs on the Kubernetes platform while TensorFlow Extended runs on Borg.

What is Kubeflow Pipelines?

The Kubeflow Pipelines platform consists of the following components:

A console for running and tracing experiments.
The Argo workflow engine for performing multiple machine learning steps.
A software development kit (SDK) for defining workflows. Currently, only the Python SDK is supported.

You can use Kubeflow Pipelines to achieve the following goals:

End-to-end task orchestration: You can orchestrate and organize a complex machine learning workflow. This workflow can be triggered directly at a scheduled time, or be triggered by events or even by data changes.
Easy experiment management: Scientists can try numerous ideas and frameworks and manage various experiments. Kubeflow Pipelines also facilitates the transition from experiments to production.
Easy reuse: You can quickly create end-to-end solutions by reusing pipelines and components, without the need to rebuild experiments from scratch each time.

Run Kubeflow Pipelines on Alibaba Cloud

In view of the capabilities of Kubeflow Pipelines, are you eager to take a sneak peek? However, to use Kubeflow Pipelines in China, you must overcome the following challenges:

Pipelines must be deployed by using Kubeflow. However, Kubeflow includes many built-in components and it is complicated to use Ksonnet to deploy Kubeflow.
In addition, with its heavy dependence on the Google cloud platform, Pipelines cannot run on other cloud platforms or bare metal servers.

To allow users to install Kubeflow Pipelines in China, the Alibaba Cloud Container Service team provides a Kustomize-based deployment solution. Unlike basic Kubeflow services, Kubeflow Pipelines depends on stateful services such as MySQL and Minio. Therefore, data persistence and backup are required. In this example, we use standard Alibaba Cloud solid-state drives (SSDs) in the data persistence solution to automatically store MySQL and Minio data separately. You can deploy the latest version of Kubeflow Pipelines on Alibaba Cloud offline.

Prerequisites

You must install Kustomize.

In Linux or Mac OS, run the following commands:

opsys=linux  # or darwin, or windows
curl -s https://api.github.com/repos/kubernetes-sigs/kustomize/releases/latest |\
  grep browser_download |\
  grep $opsys |\
  cut -d '"' -f 4 |\
  xargs curl -O -L
mv kustomize_*_${opsys}_amd64 /usr/bin/kustomize
chmod u+x /usr/bin/kustomize

In Windows, you can download and install kustomize_2.0.3_windows_amd64.exe.

For more information about creating a Kubernetes cluster in Alibaba Cloud Container Service, click Here.

Deployment Procedure

Log on to the Kubernetes cluster through secure shell (SSH). For more information about this step, click Here.

Download the source code.

yum install -y git
git clone --recursive https://github.com/aliyunContainerService/kubeflow-aliyun

Specify security configurations.

Configure a transport layer security (TLS) certificate. If you do not have any TLS certificates, run the following commands to generate one:

yum install -y openssl
domain="pipelines.kubeflow.org"
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout kubeflow-aliyun/overlays/ack-auto-clouddisk/tls.key -out kubeflow-aliyun/overlays/ack-auto-clouddisk/tls.crt -subj "/CN=$domain/O=$domain"

If you have a TLS certificate, upload the private key and certificate to kubeflow-aliyun/overlays/ack-auto-clouddisk/tls.key and kubeflow-aliyun/overlays/ack-auto-clouddisk/tls.crt, respectively.

Set a password for the admin account.

yum install -y httpd-tools
htpasswd -c kubeflow-aliyun/overlays/ack-auto-clouddisk/auth admin
New password:
Re-type new password:
Adding password for user admin

Use Kustomize to generate a .yaml deployment file.

cd kubeflow-aliyun/
kustomize build overlays/ack-auto-clouddisk > /tmp/ack-auto-clouddisk.yaml

Check the region and zone of the Kubernetes cluster, and replace the zone as required. For example, if your cluster is in the cn-hangzhou-g zone, you can run the following commands:

sed -i.bak 's/regionid: cn-beijing/regionid: cn-hangzhou/g' \
    /tmp/ack-auto-clouddisk.yaml
sed -i.bak 's/zoneid: cn-beijing-e/zoneid: cn-hangzhou-g/g' \
    /tmp/ack-auto-clouddisk.yaml

We recommend that you check whether the /tmp/ack-auto-clouddisk.yaml file is updated.

Change the container image address from gcr.io to registry.aliyuncs.com.

sed -i.bak 's/gcr.io/registry.aliyuncs.com/g' \
    /tmp/ack-auto-clouddisk.yaml

We recommend that you check whether the /tmp/ack-auto-clouddisk.yaml file is updated.

Adjust the disk space, for example, to 200 GB.

sed -i.bak 's/storage: 100Gi/storage: 200Gi/g' \
    /tmp/ack-auto-clouddisk.yaml

Verify the .yaml file of the Kubeflow Pipelines.

kubectl create --validate=true --dry-run=true -f /tmp/ack-auto-clouddisk.yaml

Deploy the Kubeflow Pipelines service through kubectl.

kubectl create -f /tmp/ack-auto-clouddisk.yaml

Let's look at how to access the Kubeflow Pipelines service. Here, we use Ingress to expose the Kubeflow Pipelines service. In this example, the IP address of the Kubeflow Pipelines service is 112.124.193.271. The URL of the Kubeflow Pipelines console is [https://112.124.193.271/pipeline/]().

kubectl get ing -n kubeflow
NAME             HOSTS   ADDRESS           PORTS     AGE
ml-pipeline-ui   *       112.124.193.271   80, 443   11m

Gain access to the Kubeflow Pipelines console.

If you are using a self-signed certificate, the system will notify you that the connection is not private. You can click Advanced to view details and then click Visit to visit the website.

Enter the username admin and the password you specified before.

Now, you can manage and run training tasks in the Kubeflow Pipelines console.

Some Considerations.

1. Why do we use standard Alibaba Cloud SSDs in this example?

Standard Alibaba Cloud SSDs have several advantages. For example, you can schedule standard Alibaba Cloud SSDs to periodically back up the metadata of Kubeflow Pipelines to prevent data loss.

2. How can I back up the data of a disk?

If you want to back up the data stored in a disk, you can manually create snapshots of the disk or apply an automatic snapshot creation policy to the disk to automatically create snapshots on a schedule.

3. How can I undeploy Kubeflow Pipelines?

To undeploy Kubeflow Pipelines, follow these steps:

Delete the components of the Kubeflow Pipelines.

kubectl delete -f /tmp/ack-auto-clouddisk.yaml

Click Release disks to release the two disks that store MySQL and Minio data.

4. How can I use an existing disk as a database storage instead of automatically creating a disk?

For a detailed answer to this question, click here.

Summary

This document has gone over what Kubeflow Pipelines are, covered some of the issues resolved by Kubeflow Pipelines, and the procedure for using Kustomize on Alibaba Cloud to quickly build Kubeflow Pipelines for machine learning. In the future, we will share the procedure for developing a complete machine learning workflow based on Kubeflow Pipelines.

Community

Build a Machine Learning System Using Kubernetes

What is Kubeflow Pipelines?

Run Kubeflow Pipelines on Alibaba Cloud

Prerequisites

Deployment Procedure

Some Considerations.

Summary

Read previous post:

Read next post:

Alibaba Container Service

You may also like

Comments

Alibaba Container Service

Related Products

ACK One

Container Registry

Container Service for Kubernetes

Elastic Container Instance