All Products
Search
Document Center

Data Lake Formation:EMR+DLF data lake solution

Last Updated:Mar 26, 2026

The E-MapReduce (EMR) and Data Lake Formation (DLF) combination gives you a centralized, fully managed metadata and permission layer for your data lake on Alibaba Cloud. With this solution, you can ingest data from multiple sources and query it across compute engines without managing a separate metadata store.

After completing this guide, you will have:

  • A running EMR DataLake cluster backed by DLF Unified Metadata

  • Initialized metadata and data in your data lake

  • The ability to query data using Spark SQL or Presto

  • (Optional) Fine-grained permission management and lifecycle rules in place

Before you begin: Steps 2 and 3 each have multiple paths depending on whether you have an existing EMR cluster or are starting fresh. Identify your starting point before proceeding.

Prerequisites

Before you begin, ensure that you have:

  • An Alibaba Cloud account with EMR and DLF activated

  • Object Storage Service (OSS) activated in your target region

  • Sufficient permissions to create EMR clusters and DLF catalogs

For supported regions, see Supported regions and endpoints. For billing details, see Billing.

How it works

DLF provides a cross-engine, fully managed metadata service that replaces the per-cluster Hive metastore used in traditional EMR deployments. Key capabilities include:

Capability Description
Metadata management Visualized management with multi-version history and rollback
Metadata migration Migrate metadata from existing EMR clusters
Full-text search Search across all metadata
Data profiling File sizes, row counts, access frequency, small-file counts, file popularity, number of valid files, and more
Cross-engine support Works with MaxCompute, Flink, and Hologres in addition to the open-source EMR stack
Permission management Fine-grained controls across catalogs, databases, columns, and functions; integrations for Spark, Hive, Presto, and Impala
Lifecycle management Automatically archives data based on file popularity and update time, reducing OSS storage costs
Storage optimization Automatic optimization for the Delta Lake format to reduce storage costs

Step 1: Create an EMR DataLake cluster

When creating the cluster, select DLF Unified Metadata for the Metadata parameter — this connects the cluster to DLF.

  1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

  2. On the EMR on ECS page, click Create Cluster. On the E-MapReduce on ECS page, configure the following parameters:

    Parameter Value
    Business scenario Data Lake
    Optional services (select one at least) Hive (required). Add other services as needed.
    Metadata DLF Unified Metadata
    DLF catalog Use the default catalog or create one. If DLF is not yet activated, you are prompted to activate it first.
  3. Complete the remaining steps as prompted. For details, see Create a cluster.

Step 2: Initialize metadata

Choose the path that matches your starting point:

  • Existing EMR cluster with metadata in built-in MySQL or ApsaraDB RDS — Migrate metadata to DLF before continuing. See Migrate EMR metadata to DLF.

  • New EMR cluster with no historical metadata — Create metadata using one of the following methods:

    • DLF console (recommended): Alternatively, create databases and tables using Hive or Spark SQL.

      1. Log on to the DLF console. In the top navigation bar, select the region where OSS is activated, such as China (Hangzhou).

      2. In the left-side navigation pane, choose Metadata > Metadata.

      3. On the Database tab, click Create Database.

      4. Configure the parameters and click OK.

    • Metadata discovery (if your data is already in OSS) — Use the metadata discovery feature to scan OSS and automatically register metadata in DLF. For a step-by-step example, see DLF data exploration - Taobao user behavior analysis.

Step 3: Initialize data

Choose the path that matches your data source:

Data source Method
Existing EMR cluster (HDFS data) Use Jindo DistCp to migrate data from the cluster to OSS.
Service systems (RDS, MySQL, or Apache Kafka) Use Realtime Compute for Apache Flink to stream data into DLF. See Manage DLF catalogs.

Step 4: Query data using Spark SQL or Presto

Connect to the master node of your EMR cluster over SSH. See Log on to a cluster for instructions.

Query with Spark SQL

  1. Start Spark SQL:

    spark-sql
  2. Run a query:

    SELECT * FROM <database>.<table>;

Query with Presto

DLF uses a three-level namespace: <catalog>.<database>.<table>. The catalog identifies the data source. To view available catalogs, run show catalogs; in Presto, or check the Configure tab of the Presto service page in the EMR console.

  1. Start Presto CLI, replacing master-1-1 with the hostname of your master node:

    presto --server master-1-1:8889
  2. Run a query:

    SELECT * FROM <catalog>.<database>.<table>;

    For example, to query the test table in the default database of Hive:

    SELECT * FROM hive.default.test;

(Optional) Step 5: Enable permission management

For data lakes with strict access control requirements, enable DLF permission management to enforce fine-grained permissions across all data in your EMR cluster. After enabling it, users must be granted explicit permissions before they can access any data.

  1. Enable DLF permission management for your EMR cluster. See DLF-Auth.

  2. Configure permissions for your data catalogs in DLF. See Configure permissions.

To grant permissions to users, see Data authorization. For an end-to-end walkthrough, see Use DLF and EMR to manage permissions.

(Optional) Step 6: Configure lifecycle management

Lifecycle management lets you define data retention rules for databases and tables in your data lake. DLF converts the OSS storage class of qualifying data based on three rule types:

Rule type Description
Creation time Based on partition and table creation time
Last modification time Based on last modification time of partitions and tables
Partition value Based on the value of the partition key

This reduces long-term storage costs without manual intervention. For setup instructions, see Lifecycle management.

What's next