This topic describes how to configure Data Lake Formation (DLF) to extract metadata from data stored in Object Storage Service (OSS). You can then use a MaxCompute external schema to perform federated queries on the data in the data lake. This solution simplifies data analytics and processing and ensures data reliability and security.
Usage notes
Activated services
The OSS service is activated.
The DLF service is activated.
A MaxCompute project is created and schema support for project-level metadata is enabled.
Supported regions
Region Name
Region ID
China (Hangzhou)
cn-hangzhou
China (Shanghai)
cn-shanghai
China (Beijing)
cn-beijing
China (Zhangjiakou)
cn-zhangjiakou
China (Shenzhen)
cn-shenzhen
China (Hong Kong)
cn-hongkong
Singapore
ap-southeast-1
Germany (Frankfurt)
eu-central-1
MaxCompute, OSS, and DLF must be deployed in the same region.
Procedure
Step 1: Grant MaxCompute permissions to access DLF and OSS
The account associated with your MaxCompute project requires permissions to access the DLF and OSS services. You can use one of the following authorization methods:
One-click authorization: If you use the same account for both the MaxCompute project and the DLF service, you can click Authorize DLF.
Custom authorization: Use the custom authorization method if you use different accounts.
Step 2: Prepare the OSS data
Log on to the OSS console and create a bucket. In this example, the bucket is named
mc-lakehouse-dlf-oss. For more information, see Create a bucket.This example uses a partitioned table. Create a root folder named
mc_dlf_oss_pt/in the bucket. Then, create four subdirectories in the root folder and upload files to them:direction=N/: vehicle1.csvdirection=NE/: vehicle2.csvdirection=S/:vehicle3.csvdirection=SW/: vehicle4.csv
Step 3: Extract OSS metadata using DLF 1.0
Log on to the Data Lake Formation (DLF) console and select a region in the upper-left corner.
In the navigation pane on the left, choose .
On the Metadata page, click the Database tab.
Under the default Catalog List, click Create Database. Configure the following parameters:
On the Metadata Management page, click the Database tab. In the default data catalog, click Create Database. Configure the following parameters:
Parameter
Required
Description
Catalog
Required
In this example, the data catalog is default.
Database Name:
Required
Enter a custom database name. The name must start with a letter, be 1 to 128 characters in length, and can contain only letters, digits, and underscores (_). Example:
dlf_oss_csv.Database Description:
Optional
Enter a custom description.
Select Path:
Required
The storage location of the database. Example:
oss://mc-lakehouse-dlf-oss/metadb_save.In the navigation pane on the left, choose .
On the Metadata Discovery page, click Create Extraction Task. On the Create Extraction Task page, configure the following parameters:
Set Extraction Source:
Parameter
Required
Description
Extraction Task Name
This parameter is required.
Enter a custom name for the extraction task.
Select OSS Path
Required
DLF automatically creates tables and partitions based on the OSS directory that you specified. Directory structure: oss://Selected path/Table name (Optional)/Partition name (Optional)/File name. For example, the data.csv file is used to extract metadata. If this file is saved to the oss://my-bucket/my-path/my-table/dt=1/ directory and the oss://my-bucket/my-path/ directory is selected, DLF uses my_table as the table name, dt as a partition, and the data.csv file as the table schema. All directory names that include the table name can contain letters, digits, and underscores (_), and must start with a letter.
Exclusion Mode
Optional
Enter a regular expression to match the OSS directory that needs to be excluded
Parse Format
Required
The service provides multiple table formats, such as Automatic Recognition, json, csv, parquet, orc, hudi, delta, and avro.
Set Extraction Target:
Parameter
Required
Description
Catalog
Required
Select the data catalog for the target information.
Destination Database
Required
Select an existing database. Example:
dlf_oss_csv.Destination Table Prefix
Optional
Enter the prefix of the destination table. It must be 1 to 128 characters in length, and can contain letters, digits, and underscores (_). It must start with a letter.
Method to Handle Table Field Update
Required
Add Columns and Retain Existing Columns.
Update Table Schema and Generate Table Results Based on the last detected table schema (If you select this option, the metadata extraction result overwrites existing tables with the same name. This may cause the original tables to be lost or become unavailable.)
Ignore Updates and Not Modify Table.
Method to Process Deleted OSS Objects
Required
Delete Metadata.
Ignore Updates and Not Delete Tables.
Set Extraction Task:
Parameter
Required
Description
RAM Role
Required
Data Lake Formation assumes this role to access your resources. You can select the default system role AliyunDLFWorkFlowDefaultRole.
Execution Policy
Optional
The execution policy for the job.
Manual
Scheduling
Extraction Policy
Required
Partial Data Extraction: Scans only part of each file to extract metadata. This shortens the job duration but may result in lower accuracy than full extraction. You can edit the metadata to adjust the information.
Extract All: Scans all data files to extract metadata. This takes longer for large datasets but provides more accurate results.
Click Save and Execute Immediately.
View the extracted table
In the navigation pane on the left, choose .
On the Metadata page, click the Database tab.
Click the name of the target database to go to its details page. Click the Table List tab to view the extracted tables.
In this example, the extracted table is
mc_dlf_oss_pt.
If you add data to a new partition, you must run the extraction task again.
Step 4: Create a DLF 1.0 and OSS external data source
Log on to the MaxCompute console and select a region in the top-left corner.
In the navigation pane on the left, choose .
On the Foreign Server page, click Create Foreign Server.
In the Add Foreign Server dialog box, configure the parameters. The following tables describe the parameters.
Parameter
Required
Description
Foreign Server Type
Required
Select DLF+OSS.
Foreign Server Name
Required
Enter a custom name. The naming conventions are as follows:
The name must start with a letter and contain only lowercase letters, underscores (_), and digits.
The name cannot exceed 128 characters.
Example:
dlf_oss_csv_new.Foreign Server Description
Optional
Enter a description as needed.
Region
Required
The current region is selected by default.
DLF Endpoint
Required
The DLF Endpoint of the current region is used by default.
OSS Endpoint
Required
The OSS Endpoint of the current region is used by default.
RoleARN
Required
The Alibaba Cloud Resource Name (ARN) of the RAM role. This role must have permissions to access both DLF and OSS services.
Log on to the Resource Access Management (RAM) console.
In the navigation pane on the left, choose .
In the Basic Information section, you can find the ARN.
Example:
acs:ram::124****:role/aliyunodpsdefaultrole.Foreign Server Supplemental Properties
Optional
Specify additional attributes for the external data source. After you specify the attributes, tasks that use this data source can access the source system based on the defined behavior.
NoteFor information about supported parameters, see future updates in the official documentation. More parameters will be available as the product evolves.
Click OK to create the external data source.
On the Foreign Server page, find the target data source and click Details in the Actions column.
Step 5: Create an external schema
Connect to MaxCompute and run the following command:
SET odps.namespace.schema=true;
CREATE EXTERNAL SCHEMA IF NOT EXISTS <external_schema>
WITH <external_data_source>
ON '<dlf_data_catalogue>.dlf_database';The parameters are described as follows:
external_schema: A custom name for the external schema. Example:
es_dlf_oss_csv.external_data_source: The name of the external data source that you created. The project that contains the external schema must be in the same region as the external data source. Example:
dlf_oss_csv.dlf_data_catalogue: The DLF data catalog ID. For more information, see Create a data catalog. Example:
122****.dlf_database: The database name in the specified DLF data catalog. For more information, see Databases, tables, and functions. Example:
dlf_oss_csv.
Step 6: Use SQL to access OSS data
Log on to the MaxCompute client and query the tables in the external schema.
SET odps.namespace.schema=true;
USE schema es_dlf_oss_csv;
SHOW tables IN es_dlf_oss_csv;
-- The following result is returned:
ALIYUN$xxx:mc_dlf_oss_pt
OKQuery the details of the mc_dlf_oss_pt table in the external schema.
SET odps.namespace.schema=true;
SELECT * FROM <project_name>.es_dlf_oss_csv.mc_dlf_oss_pt WHERE direction='NE';
-- The following result is returned:
+------------+------------+------------+------------+------------+------------+----------------+------------+
| _c0 | _c1 | _c2 | _c3 | _c4 | _c5 | _c6 | direction |
+------------+------------+------------+------------+------------+------------+----------------+------------+
| 1 | 2 | 13 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | NE |
| 1 | 3 | 48 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | NE |
| 1 | 9 | 4 | 1 | 46.81006 | -92.08174 | 9/15/2014 0:00 | NE |
+------------+------------+------------+------------+------------+------------+----------------+------------+Step 8: Copy data from the federated external table to a data warehouse
Copy the data from the federated external table to the data warehouse.
CREATE TABLE vehicle_copy AS SELECT * FROM <project_name>.es_dlf_oss_csv.mc_dlf_oss_pt WHERE direction='NE';Query the data from the table in the data warehouse.
SELECT * FROM vehicle_copy; -- The following result is returned: +------------+------------+------------+------------+------------+------------+----------------+------------+ | _c0 | _c1 | _c2 | _c3 | _c4 | _c5 | _c6 | direction | +------------+------------+------------+------------+------------+------------+----------------+------------+ | 1 | 2 | 13 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | NE | | 1 | 3 | 48 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | NE | | 1 | 9 | 4 | 1 | 46.81006 | -92.08174 | 9/15/2014 0:00 | NE | +------------+------------+------------+------------+------------+------------+----------------+------------+