All Products
Search
Document Center

E-MapReduce:Use third-party Python libraries in a notebook

Last Updated:Mar 26, 2026

EMR Serverless Spark notebooks support three ways to install third-party Python libraries for interactive PySpark jobs. Choose a method based on how the libraries are used and whether they need to persist across sessions.

MethodBest for
Method 1: install with pipProcessing non-Spark variables in a notebook, such as values returned by Spark or custom variables
Method 2: create a runtime environmentPySpark jobs that need libraries preinstalled at every session start
Method 3: Add Spark configurations with a conda environmentPySpark distributed computing that requires libraries available across all Spark executors

Prerequisites

Before you begin, ensure that you have:

Method 1: install with pip

Use this method to install libraries directly into the current session for processing non-Spark variables.

Important

Libraries installed with pip are not persisted. Reinstall them each time you restart a notebook session.

  1. In the E-MapReduce (EMR) console, go to EMR Serverless > Spark, click your workspace name, then click Development in the left navigation pane.

  2. Double-click your notebook to open it.

  3. In a Python cell, run the following command to install scikit-learn:

    pip install scikit-learn

Example: classify data with scikit-learn

The following example trains a support vector machine (SVM) classifier on the Iris dataset to demonstrate scikit-learn in a notebook.

Add a Python cell, enter the following code, and click the image icon to run it.

# Import datasets from the scikit-learn library.
from sklearn import datasets

# Load the built-in Iris dataset.
iris = datasets.load_iris()
X = iris.data    # Feature data
y = iris.target  # Labels

# Split into training and test sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train an SVM classifier with a linear kernel.
from sklearn.svm import SVC
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)

# Evaluate the model.
from sklearn.metrics import classification_report, accuracy_score
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

The output looks similar to the following:

image

Method 2: create a runtime environment

Use this method to define a persistent custom Python environment. The libraries you specify are preinstalled each time the notebook session starts.

Step 1: create a runtime environment

  1. In the EMR console, go to EMR Serverless > Spark, click your workspace name, then click Environments in the left navigation pane.

  2. Click Create Environment.

  3. On the Create Environment page, configure the Name parameter, then click Add Library in the Libraries section. For more information, see Manage runtime environments.

  4. In the New Library dialog, set Type to PyPI, enter the library name (and optionally a version) in the PyPI Package field, then click OK. If you omit the version, the latest version is installed. Example: scikit-learn.

  5. Click OK to create the environment. The system begins initializing the runtime environment.

Step 2: attach the runtime environment to a session

Stop the session before modifying it.
  1. In the left navigation pane, go to O&M Center > Sessions, then click the Notebook Session tab.

  2. Find your session and click Edit in the Actions column.

  3. Select the runtime environment you created from the Environment drop-down list, then click Save Changes.

  4. Click Start in the upper-right corner.

Step 3: classify data with scikit-learn

  1. In the left navigation pane, click Development, then double-click your notebook.

  2. Add a Python cell, enter the following code, and click the image icon to run it.

# Import datasets from the scikit-learn library.
from sklearn import datasets

# Load the built-in Iris dataset.
iris = datasets.load_iris()
X = iris.data    # Feature data
y = iris.target  # Labels

# Split into training and test sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train an SVM classifier with a linear kernel.
from sklearn.svm import SVC
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)

# Evaluate the model.
from sklearn.metrics import classification_report, accuracy_score
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

The output looks similar to the following:

image

Method 3: add Spark configurations with a conda environment

Use this method when libraries must be available to Spark executors for distributed computing. You package a conda environment locally, upload it to Object Storage Service (OSS), and configure the session to use it.

Requirements: ipykernel 6.29 or later, jupyter_client 8.6 or later, Python 3.8 or later. The environment must be packaged on Linux (x86 architecture).

Step 1: create and package a conda environment

  1. Install Miniconda:

    wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
    chmod +x Miniconda3-latest-Linux-x86_64.sh
    ./Miniconda3-latest-Linux-x86_64.sh -b
    source miniconda3/bin/activate
  2. Create a conda environment with Python 3.8, install the required libraries, and package it:

    # Create and activate the environment.
    conda create -y -n pyspark_conda_env python=3.8
    conda activate pyspark_conda_env
    
    # Install third-party libraries.
    pip install numpy \
      ipykernel~=6.29 \
      jupyter_client~=8.6 \
      jieba \
      conda-pack
    
    # Package the environment.
    conda pack -f -o pyspark_conda_env.tar.gz

Step 2: upload the package to OSS

Upload pyspark_conda_env.tar.gz to an OSS bucket and copy the full OSS path. See Simple upload.

Step 3: configure and start the notebook session

Stop the session before modifying it.
  1. In the left navigation pane, go to O&M Center > Sessions, then click the Notebook Session tab.

  2. Find your session and click Edit in the Actions column.

  3. On the Modify Notebook Session page, add the following to the Spark Configuration field and click Save Changes:

    spark.archives  oss://<yourBucket>/path/to/pyspark_conda_env.tar.gz#env
    spark.pyspark.python ./env/bin/python

    Replace <yourBucket>/path/to with the OSS path you copied in the previous step.

  4. Click Start in the upper-right corner.

Step 4: process text data with jieba

jieba is a third-party Python library for Chinese text segmentation. For license information, see LICENSE.
  1. In the left navigation pane, click Development, then double-click your notebook.

  2. Add a Python cell, enter the following code, and click the image icon to run it.

    import jieba
    
    strs = [
        "EMR Serverless Spark is a fully-managed serverless service for large-scale data processing and analysis.",
        "EMR Serverless Spark supports efficient end-to-end services, such as task development, debugging, scheduling, and O&M.",
        "EMR Serverless Spark supports resource scheduling and dynamic scaling based on job loads."
    ]
    
    sc.parallelize(strs).flatMap(lambda s: jieba.cut(s, use_paddle=True)).collect()

    The output looks similar to the following:

    image