Create a Data Science cluster on EMR on ACK - E-MapReduce

Create an E-MapReduce (EMR) Data Science cluster on an existing Container Service for Kubernetes (ACK) cluster to run big data ETL workloads and GPU-accelerated AI training jobs.

Prerequisites

Before you begin, ensure that you have:

The AliyunOSSFullAccess and AliyunDLFFullAccess policies attached to your Alibaba Cloud account. For details, see Attach policies to a RAM role.
An ACK cluster (dedicated or managed) that meets the following requirements: To create an ACK cluster, see Create an ACK dedicated cluster or Create an ACK managed cluster.
Requirement Value
Kubernetes version 1.22–1.24
vCPU 16 or more
Memory 64 GiB or more
Instance type General-purpose, compute-optimized, or memory-optimized (ecs.g5, ecs.g6, ecs.g7, or higher)
A node pool created in the ACK cluster. For details, see Create and manage a node pool.

Requirement	Value
Kubernetes version	1.22–1.24
vCPU	16 or more
Memory	64 GiB or more
Instance type	General-purpose, compute-optimized, or memory-optimized (ecs.g5, ecs.g6, ecs.g7, or higher)

Important

Each ACK cluster can be associated with only one Data Science cluster.

Warning

Creating a Data Science cluster overwrites the following namespaces in the associated ACK cluster: anonymous, cert-manager, fluid-system, ingress-nginx, istio-system, knative-serving, kubeflow, kubernetes-dashboard, and monitoring.

Create a Data Science cluster

Log on to the EMR console. In the left-side navigation pane, click EMR on ACK.
On the EMR on ACK page, click Create Cluster.
On the E-MapReduce on ACK page, configure the cluster parameters. See Parameter reference for details on each field.
Click Create.

The cluster is ready when its status changes to Running.

Parameter reference

Parameter	Description
Region	The region where the cluster is created. The region cannot be changed after the cluster is created.
Cluster type	Select Data Science. Data Science clusters support offline ETL with Hive and Spark, and TensorFlow model training using a CPU+GPU heterogeneous computing framework with NVIDIA GPU deep learning algorithms — suited for big data and AI workloads.
Product version	The EMR version to deploy. The latest version is selected by default.
Component version	Read-only. Displays the components and their versions included in the selected cluster type.
ACK Cluster	Select an existing ACK cluster, or go to the ACK console to create one.
Configure Dedicated Nodes	(Optional) Add taints and labels to a node or node pool to reserve it exclusively for EMR workloads.
Cluster name	A name for the cluster. Must be 1–64 characters and can contain letters, digits, hyphens (-), and underscores (_).