All Products
Search
Document Center

Data Lake Formation:Access DLF from EMR on ECS Spark

Last Updated:Mar 26, 2026

Connect Apache Spark on E-MapReduce (EMR) on ECS to a Data Lake Formation (DLF) catalog using Apache Paimon's REST metastore interface.

Prerequisites

Before you begin, ensure that you have:

  • An EMR on ECS cluster running version 5.12.0 or later, with Spark 3 and Paimon selected as components. For other version requirements, contact DLF developers via DingTalk group 106575000021

  • Completed the Quick Start for DLF

  • EMR and DLF deployed in the same region, with the Virtual Private Cloud (VPC) of your EMR cluster added to the DLF whitelist

Create a catalog

See Set up DLF.

Grant DLF permissions to a role

Step 1: Attach the RAM policy to AliyunECSInstanceForEMRRole

Note

Skip this step after EMR is natively integrated with DLF.

  1. Log on to the Resource Access Management (RAM) console using your Alibaba Cloud account or as a RAM administrator.

  2. In the navigation pane, choose Identity Management > Roles, then search for AliyunECSInstanceForEMRRole.

  3. In the Actions column, click Add Permissions.

  4. Under Permission Policies, search for and select AliyunDLFFullAccess, then click Confirm.

    image

Step 2: Grant DLF permissions to AliyunECSInstanceForEMRRole

  1. Log on to the Data Lake Formation console.

  2. On the Catalogs page, click the name of the catalog to open its details page.

  3. Click the Permissions tab to grant permissions at the catalog level. To grant permissions at a lower scope, navigate to the target database or table and click its Permissions tab instead.

  4. On the authorization page, configure the following settings and click OK.

    Note

    If AliyunECSInstanceForEMRRole does not appear in the dropdown list, go to the user management page and click Sync.

    FieldValue
    User/RoleRAM User/RAM Role
    Select Authorization ObjectAliyunECSInstanceForEMRRole
    Preset Permission TypeSelect read permissions manually, or choose a predefined role: Data Reader or Data Editor

Upgrade Paimon dependencies in your EMR cluster

Download Paimon version 1.1 or later from the Maven repository. You need two JAR files that match the Spark version of your EMR cluster:

JAR fileDescription
paimon-jindo-*.jarPaimon integration for Alibaba Cloud storage (Jindo)
paimon-spark-3.x-*.jarPaimon connector for Spark 3. Match the 3.x suffix to your Spark minor version.

Step 1: Upload JAR files and the upgrade script to OSS

  1. Upload both JAR files to Object Storage Service (OSS) and set their permissions to public-read. For upload instructions, see Simple upload.

  2. Modify the following script by replacing the two placeholder URLs with the actual OSS download URLs of your JAR files, then upload the script to OSS.

    Important

    EMR on ECS clusters cannot access the public network by default. Use private network URLs when your OSS bucket and EMR cluster are in the same region.

    PlaceholderJAR fileURL format
    <paimon-jindo-1.1.0.jar-url>paimon-jindo-*.jarPrivate network: https://<bucket>.oss-cn-hangzhou-internal.aliyuncs.com/jars/paimon-jindo-1.1.0.jar<br>Public network: https://<bucket>.oss-cn-hangzhou.aliyuncs.com/jars/paimon-jindo-1.1.0.jar
    <paimon-spark-3.x-1.1.0.jar-url>paimon-spark-3.x-*.jarPrivate network: https://<bucket>.oss-cn-hangzhou-internal.aliyuncs.com/jars/paimon-spark-3.x-1.1.0.jar<br>Public network: https://<bucket>.oss-cn-hangzhou.aliyuncs.com/jars/paimon-spark-3.x-1.1.0.jar
    #!/bin/bash
    
    echo 'clean up paimon-dlf-2.5 exists file'
    rm -rf /opt/apps/PAIMON/paimon-dlf-2.5
    rm -rf /opt/apps/PAIMON/paimon-dlf-2.5.tar.gz.*
    cd /opt/apps/PAIMON/paimon-current/lib/spark3
    mkdir -p /opt/apps/PAIMON/paimon-dlf-2.5/lib/spark3
    cd /opt/apps/PAIMON/paimon-dlf-2.5/lib/spark3
    wget <paimon-jindo-1.1.0.jar-url>
    wget <paimon-spark-3.x-1.1.0.jar-url>
    
    echo 'link paimon-current to paimon-dlf-2.5'
    rm -f /opt/apps/PAIMON/paimon-current
    ln -sf /opt/apps/PAIMON/paimon-dlf-2.5 /opt/apps/PAIMON/paimon-current

    Replace each placeholder with the OSS download URL of the corresponding JAR file:

Step 2: Run the script on your EMR cluster

Run the script as a Script Action across all nodes. For details, see Manually run a script.

  1. In your EMR cluster, go to Script Action > Manual Execution and click Create And Execute.

  2. In the dialog box, configure the following settings and click OK.

    FieldValue
    NameA custom name for the script
    Script LocationThe OSS path of the upgrade script. Format: oss://**/*.sh
    Execution ScopeCluster
  3. After the script completes, restart the Spark service for the changes to take effect.

Read and write data with Spark

Connect to the Paimon catalog

Run the following spark-sql command. Replace <regionID> with your region ID (for example, cn-hangzhou), and replace <catalog> with the name of the DLF catalog you created.

spark-sql --master yarn \
  --conf spark.driver.memory=5g \
  --conf spark.sql.defaultCatalog=paimon \
  --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog \
  --conf spark.sql.catalog.paimon.metastore=rest \
  --conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions \
  --conf spark.sql.catalog.paimon.uri=http://<regionID>-vpc.dlf.aliyuncs.com \
  --conf spark.sql.catalog.paimon.warehouse=<catalog> \
  --conf spark.sql.catalog.paimon.token.provider=dlf \
  --conf spark.sql.catalog.paimon.dlf.token-loader=ecs

Key parameters:

ParameterDescription
spark.sql.catalog.paimon.metastoreSet to rest to use the Paimon REST metastore protocol.
spark.sql.catalog.paimon.uriThe DLF REST catalog endpoint for your region. Uses the VPC internal address.
spark.sql.catalog.paimon.warehouseThe DLF catalog name. This is a catalog instance name, not a file path.
spark.sql.catalog.paimon.token.providerSet to dlf to authenticate with DLF.
spark.sql.catalog.paimon.dlf.token-loaderSet to ecs to load credentials from the ECS instance RAM role automatically, without configuring an access key.
Note

This guide uses ECS instance RAM role authentication. For other authentication methods (access key, STS token), see the Apache Paimon DLF token documentation.

Create tables

Run the following SQL to create a managed table and a foreign table.

CREATE TABLE user_samples
(
    user_id INT,
    age INT,
    gender_code STRING,
    clk BOOLEAN
);

CREATE TABLE user_samples_di (
    user_id INT,
    age INT,
    gender_code STRING,
    clk BOOLEAN
)
USING CSV
OPTIONS(
'path'='oss://<bucket>/user/user_samples_di'
);

Table behavior:

TableTypeMetadataData filesDrop behavior
user_samplesManagedDLFOSS (under the catalog's default path)
user_samples_diForeignDLFOSS (at the path you specify)Only metadata is deleted; data files in OSS are retained
Note

If you do not specify a database, tables are created in the default database of the catalog. Create the /user/user_samples folder in OSS before running the managed table statement.

Insert data

INSERT INTO user_samples VALUES
(1, 25, 'M', true),
(2, 18, 'F', false);

INSERT INTO user_samples_di VALUES
(1, 25, 'M', true),
(2, 18, 'F', true),
(3, 35, 'M', true);

Query data

SELECT * FROM user_samples;
SELECT * FROM user_samples_di;
imageimage

Merge data

The following statement merges rows from user_samples_di into user_samples, matching on user_id. Matched rows are updated; unmatched rows are inserted.

MERGE INTO user_samples
USING user_samples_di
ON user_samples.user_id = user_samples_di.user_id
WHEN MATCHED THEN
UPDATE SET
  age = user_samples_di.age,
  gender_code = user_samples_di.gender_code,
  clk = user_samples_di.clk
WHEN NOT MATCHED THEN
  INSERT (user_id, age, gender_code, clk)
  VALUES (user_samples_di.user_id, user_samples_di.age, user_samples_di.gender_code, user_samples_di.clk);
image

Considerations

  • Foreign table vs. managed table: Dropping a foreign table removes only its metadata—data files stored in OSS are not deleted.

  • OSS directory pre-creation: You must create the /user/user_samples directory in OSS before creating a managed table. If the directory does not exist, the table creation statement fails.

  • Network access: EMR on ECS clusters cannot access the public network by default. When uploading Paimon JARs to OSS, use private network URLs to ensure the upgrade script can download them from within the cluster.

  • VPC whitelist: The VPC of your EMR cluster must be added to the DLF whitelist before Spark can connect to the DLF catalog endpoint.