All Products
Search
Document Center

Platform For AI:Connect to EMR to process big data

Last Updated:Jan 19, 2026

Connect Data Science Workshop (DSW) to an EMR cluster to submit and run Spark jobs. This integration lets you use the powerful performance of EMR to process data efficiently and seamlessly transition to model development, merging big data with AI.

Background information

Data pre-processing is a crucial step in the machine learning and large language model (LLM) fields. This process is often time-consuming and complex, and includes key steps such as data cleaning, transformation, and feature engineering. To address this challenge, DSW and the open-source big data platform EMR have partnered to offer an all-in-one solution that integrates big data and AI.

EMR is a fully managed big data processing service on Alibaba Cloud that integrates Apache Spark. It lets you easily build, manage, and use Spark clusters in the cloud for large-scale data processing, real-time computing, machine learning tasks, and graph processing.

Scope

  • Only the following types of DSW instances can connect to an EMR cluster:

    Pay-as-you-go DSW instances created using public resource groups.

  • Only the following types of EMR clusters are supported:

    • DataLake clusters

    • Custom clusters with Spark 3 and Hadoop installed

Note

Each DSW instance can connect to only one EMR cluster. After the connection is established, you cannot switch to another cluster.

Prerequisites

  • You have activated EMR and created a cluster. For more information, see Create a cluster.

  • You have created a DSW instance. The operating system of the image must be Ubuntu 20.04 or earlier. For the network configuration, you must specify the same VPC and security group as the EMR cluster. For more information, see Create a DSW instance.

    Important

    The VPC and security group of the DSW instance must be the same as those of the EMR cluster. Otherwise, the configuration fails.

Procedure

Open the tutorial file in DSW

  1. Go to the DSW development environment.

    1. Log on to the PAI console.

    2. In the upper-left corner of the page, select the region where the DSW instance is located.

    3. In the navigation pane on the left, click Workspaces. On the Workspace List page, click the name of the default workspace to open it.

    4. In the navigation pane on the left, choose Model Training Data Science Workshop (DSW).

    5. In the Actions column for the instance that you want to open, click Open to launch the DSW development environment.

  2. On the Launcher page of the Notebook tab, in the Quick Start area, click DSW Gallery under Tool to open the DSW Gallery page.

  3. On the DSW Gallery page, search for Big Data And AI Integration: Submit Spark Jobs To An EMR Cluster. Click Open in DSW. The resources and the tutorial file for this tutorial are automatically downloaded to your DSW instance. The tutorial file opens automatically after the download is complete.

Run the tutorial file

The emr_connect.ipynb file lets you view the tutorial content and run the code directly.

In the tutorial file, click image to run the command for each step. After a command runs successfully, run the command for the next step.

image

This tutorial includes the following four steps:

  1. Select an EMR cluster.

  2. Connect to the specified cluster.

  3. Submit a job using spark-submit.

  4. Run a PySpark interactive application.

FAQ

Q: What do I do if a Spark job fails to run after I restart my DSW instance?

After you restart a DSW instance, you must repeat the steps in Select an EMR cluster and Connect to the specified cluster to re-initialize the EMR cluster connection information.

Q: What do I do if I get the "spark-submit: command not found" error?

After you connect to the EMR cluster, open a new Terminal to load the Spark environment parameters. Run the following command to check whether the Spark configuration is active in the current session. If the command returns an empty result, the configuration is not active.

env | grep -i SPARK_HOME

Q: What do I do if I get the "Python in worker has different version * than that in driver *" error?

This error indicates a mismatch between the Python version on the driver client and the version on the cluster's worker nodes.

Q: What do I do if I get the "ModuleNotFoundError: No module named ***" error?

This error occurs because the Python environment used by the Executor on the cluster does not have the required packages for the PySpark application. You can use the `spark.archives` configuration to synchronize the local Python environment with the remote cluster. Alternatively, you can manually install the dependencies on each worker node in the cluster.

Q: Why is the PySpark option missing when I create a new Notebook after initializing the PySpark kernel?

In the toolbar of the Notebook tab, choose Kernel > Restart Kernel.