The E-MapReduce (EMR) and Data Lake Formation (DLF) combination gives you a centralized, fully managed metadata and permission layer for your data lake on Alibaba Cloud. With this solution, you can ingest data from multiple sources and query it across compute engines without managing a separate metadata store.
After completing this guide, you will have:
-
A running EMR DataLake cluster backed by DLF Unified Metadata
-
Initialized metadata and data in your data lake
-
The ability to query data using Spark SQL or Presto
-
(Optional) Fine-grained permission management and lifecycle rules in place
Before you begin: Steps 2 and 3 each have multiple paths depending on whether you have an existing EMR cluster or are starting fresh. Identify your starting point before proceeding.
Prerequisites
Before you begin, ensure that you have:
-
An Alibaba Cloud account with EMR and DLF activated
-
Object Storage Service (OSS) activated in your target region
-
Sufficient permissions to create EMR clusters and DLF catalogs
For supported regions, see Supported regions and endpoints. For billing details, see Billing.
How it works
DLF provides a cross-engine, fully managed metadata service that replaces the per-cluster Hive metastore used in traditional EMR deployments. Key capabilities include:
| Capability | Description |
|---|---|
| Metadata management | Visualized management with multi-version history and rollback |
| Metadata migration | Migrate metadata from existing EMR clusters |
| Full-text search | Search across all metadata |
| Data profiling | File sizes, row counts, access frequency, small-file counts, file popularity, number of valid files, and more |
| Cross-engine support | Works with MaxCompute, Flink, and Hologres in addition to the open-source EMR stack |
| Permission management | Fine-grained controls across catalogs, databases, columns, and functions; integrations for Spark, Hive, Presto, and Impala |
| Lifecycle management | Automatically archives data based on file popularity and update time, reducing OSS storage costs |
| Storage optimization | Automatic optimization for the Delta Lake format to reduce storage costs |
Step 1: Create an EMR DataLake cluster
When creating the cluster, select DLF Unified Metadata for the Metadata parameter — this connects the cluster to DLF.
-
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
-
On the EMR on ECS page, click Create Cluster. On the E-MapReduce on ECS page, configure the following parameters:
Parameter Value Business scenario Data Lake Optional services (select one at least) Hive (required). Add other services as needed. Metadata DLF Unified Metadata DLF catalog Use the default catalog or create one. If DLF is not yet activated, you are prompted to activate it first. -
Complete the remaining steps as prompted. For details, see Create a cluster.
Step 2: Initialize metadata
Choose the path that matches your starting point:
-
Existing EMR cluster with metadata in built-in MySQL or ApsaraDB RDS — Migrate metadata to DLF before continuing. See Migrate EMR metadata to DLF.
-
New EMR cluster with no historical metadata — Create metadata using one of the following methods:
-
DLF console (recommended): Alternatively, create databases and tables using Hive or Spark SQL.
-
Log on to the DLF console. In the top navigation bar, select the region where OSS is activated, such as China (Hangzhou).
-
In the left-side navigation pane, choose Metadata > Metadata.
-
On the Database tab, click Create Database.
-
Configure the parameters and click OK.
-
-
Metadata discovery (if your data is already in OSS) — Use the metadata discovery feature to scan OSS and automatically register metadata in DLF. For a step-by-step example, see DLF data exploration - Taobao user behavior analysis.
-
Step 3: Initialize data
Choose the path that matches your data source:
| Data source | Method |
|---|---|
| Existing EMR cluster (HDFS data) | Use Jindo DistCp to migrate data from the cluster to OSS. |
| Service systems (RDS, MySQL, or Apache Kafka) | Use Realtime Compute for Apache Flink to stream data into DLF. See Manage DLF catalogs. |
Step 4: Query data using Spark SQL or Presto
Connect to the master node of your EMR cluster over SSH. See Log on to a cluster for instructions.
Query with Spark SQL
-
Start Spark SQL:
spark-sql -
Run a query:
SELECT * FROM <database>.<table>;
Query with Presto
DLF uses a three-level namespace: <catalog>.<database>.<table>. The catalog identifies the data source. To view available catalogs, run show catalogs; in Presto, or check the Configure tab of the Presto service page in the EMR console.
-
Start Presto CLI, replacing
master-1-1with the hostname of your master node:presto --server master-1-1:8889 -
Run a query:
SELECT * FROM <catalog>.<database>.<table>;For example, to query the
testtable in thedefaultdatabase of Hive:SELECT * FROM hive.default.test;
(Optional) Step 5: Enable permission management
For data lakes with strict access control requirements, enable DLF permission management to enforce fine-grained permissions across all data in your EMR cluster. After enabling it, users must be granted explicit permissions before they can access any data.
-
Enable DLF permission management for your EMR cluster. See DLF-Auth.
-
Configure permissions for your data catalogs in DLF. See Configure permissions.
To grant permissions to users, see Data authorization. For an end-to-end walkthrough, see Use DLF and EMR to manage permissions.
(Optional) Step 6: Configure lifecycle management
Lifecycle management lets you define data retention rules for databases and tables in your data lake. DLF converts the OSS storage class of qualifying data based on three rule types:
| Rule type | Description |
|---|---|
| Creation time | Based on partition and table creation time |
| Last modification time | Based on last modification time of partitions and tables |
| Partition value | Based on the value of the partition key |
This reduces long-term storage costs without manual intervention. For setup instructions, see Lifecycle management.
What's next
-
Explore the metadata discovery feature to automatically register OSS data in DLF: DLF data exploration - Taobao user behavior analysis
-
Learn how to manage permissions with DLF and EMR: Use DLF and EMR to manage permissions
-
Review billing details for DLF: Billing