Build a Scalable Data Lake with EMR and DLF - Data Lake Formation

The E-MapReduce (EMR) and Data Lake Formation (DLF) combination gives you a centralized, fully managed metadata and permission layer for your data lake on Alibaba Cloud. With this solution, you can ingest data from multiple sources and query it across compute engines without managing a separate metadata store.

After completing this guide, you will have:

A running EMR DataLake cluster backed by DLF Unified Metadata
Initialized metadata and data in your data lake
The ability to query data using Spark SQL or Presto
(Optional) Fine-grained permission management and lifecycle rules in place

Before you begin: Steps 2 and 3 each have multiple paths depending on whether you have an existing EMR cluster or are starting fresh. Identify your starting point before proceeding.

Prerequisites

Before you begin, ensure that you have:

An Alibaba Cloud account with EMR and DLF activated
Object Storage Service (OSS) activated in your target region
Sufficient permissions to create EMR clusters and DLF catalogs

For supported regions, see Supported regions and endpoints. For billing details, see Billing.

How it works

DLF provides a cross-engine, fully managed metadata service that replaces the per-cluster Hive metastore used in traditional EMR deployments. Key capabilities include:

Capability	Description
Metadata management	Visualized management with multi-version history and rollback
Metadata migration	Migrate metadata from existing EMR clusters
Full-text search	Search across all metadata
Data profiling	File sizes, row counts, access frequency, small-file counts, file popularity, number of valid files, and more
Cross-engine support	Works with MaxCompute, Flink, and Hologres in addition to the open-source EMR stack
Permission management	Fine-grained controls across catalogs, databases, columns, and functions; integrations for Spark, Hive, Presto, and Impala
Lifecycle management	Automatically archives data based on file popularity and update time, reducing OSS storage costs
Storage optimization	Automatic optimization for the Delta Lake format to reduce storage costs

Step 1: Create an EMR DataLake cluster

When creating the cluster, select DLF Unified Metadata for the Metadata parameter — this connects the cluster to DLF.

Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

On the EMR on ECS page, click Create Cluster. On the E-MapReduce on ECS page, configure the following parameters:

Parameter	Value
Business scenario	Data Lake
Optional services (select one at least)	Hive (required). Add other services as needed.
Metadata	DLF Unified Metadata
DLF catalog	Use the default catalog or create one. If DLF is not yet activated, you are prompted to activate it first.

Complete the remaining steps as prompted. For details, see Create a cluster.

Step 2: Initialize metadata

Choose the path that matches your starting point:

Existing EMR cluster with metadata in built-in MySQL or ApsaraDB RDS — Migrate metadata to DLF before continuing. See Migrate EMR metadata to DLF.
New EMR cluster with no historical metadata — Create metadata using one of the following methods:
- DLF console (recommended): Alternatively, create databases and tables using Hive or Spark SQL.
  1. Log on to the DLF console. In the top navigation bar, select the region where OSS is activated, such as China (Hangzhou).
  2. In the left-side navigation pane, choose Metadata > Metadata.
  3. On the Database tab, click Create Database.
  4. Configure the parameters and click OK.
- Metadata discovery (if your data is already in OSS) — Use the metadata discovery feature to scan OSS and automatically register metadata in DLF. For a step-by-step example, see DLF data exploration - Taobao user behavior analysis.

Step 3: Initialize data

Choose the path that matches your data source:

Data source	Method
Existing EMR cluster (HDFS data)	Use Jindo DistCp to migrate data from the cluster to OSS.
Service systems (RDS, MySQL, or Apache Kafka)	Use Realtime Compute for Apache Flink to stream data into DLF. See Manage DLF catalogs.

Step 4: Query data using Spark SQL or Presto

Connect to the master node of your EMR cluster over SSH. See Log on to a cluster for instructions.

Query with Spark SQL

Start Spark SQL:
```
spark-sql
```
Run a query:
```
SELECT * FROM <database>.<table>;
```

Query with Presto

DLF uses a three-level namespace: <catalog>.<database>.<table>. The catalog identifies the data source. To view available catalogs, run show catalogs; in Presto, or check the Configure tab of the Presto service page in the EMR console.

Start Presto CLI, replacing master-1-1 with the hostname of your master node:
```
presto --server master-1-1:8889
```
Run a query:
```
SELECT * FROM <catalog>.<database>.<table>;
```
For example, to query the test table in the default database of Hive:
```
SELECT * FROM hive.default.test;
```

(Optional) Step 5: Enable permission management

For data lakes with strict access control requirements, enable DLF permission management to enforce fine-grained permissions across all data in your EMR cluster. After enabling it, users must be granted explicit permissions before they can access any data.

Enable DLF permission management for your EMR cluster. See DLF-Auth.
Configure permissions for your data catalogs in DLF. See Configure permissions.

To grant permissions to users, see Data authorization. For an end-to-end walkthrough, see Use DLF and EMR to manage permissions.

(Optional) Step 6: Configure lifecycle management

Lifecycle management lets you define data retention rules for databases and tables in your data lake. DLF converts the OSS storage class of qualifying data based on three rule types:

Rule type	Description
Creation time	Based on partition and table creation time
Last modification time	Based on last modification time of partitions and tables
Partition value	Based on the value of the partition key

This reduces long-term storage costs without manual intervention. For setup instructions, see Lifecycle management.

What's next

Explore the metadata discovery feature to automatically register OSS data in DLF: DLF data exploration - Taobao user behavior analysis
Learn how to manage permissions with DLF and EMR: Use DLF and EMR to manage permissions
Review billing details for DLF: Billing