Query CSV files stored in Object Storage Service (OSS) directly from MaxCompute — without loading data first. Data Lake Formation (DLF) extracts metadata from your OSS directories and maps them to a MaxCompute external schema, letting you run standard SQL queries against the data lake.
Prerequisites
Before you begin, ensure that you have:
An activated OSS service
An activated DLF service
A MaxCompute project with schema support for project-level metadata enabled
MaxCompute, OSS, and DLF deployed in the same region
Supported regions:
| Region | Region ID |
|---|---|
| China (Hangzhou) | cn-hangzhou |
| China (Shanghai) | cn-shanghai |
| China (Beijing) | cn-beijing |
| China (Zhangjiakou) | cn-zhangjiakou |
| China (Shenzhen) | cn-shenzhen |
| China (Hong Kong) | cn-hongkong |
| Singapore | ap-southeast-1 |
| Germany (Frankfurt) | eu-central-1 |
Overview
The setup follows four stages:
Authorize — grant MaxCompute permissions to access DLF and OSS.
Prepare data — upload CSV files to OSS in a partitioned directory structure.
Extract metadata — use DLF to scan OSS directories and create table definitions automatically.
Connect and query — create an external data source and external schema in MaxCompute, then run SQL against the OSS data.
Step 1: Grant MaxCompute access to DLF and OSS
The account associated with your MaxCompute project needs permissions to access both DLF and OSS. Two authorization methods are available:
One-click authorization — if you use the same Alibaba Cloud account for MaxCompute and DLF, click Authorize DLF to complete authorization in one step.
Custom authorization — if you use different accounts, follow the custom authorization guide.
Step 2: Prepare the OSS data
This example uses a partitioned table where the partition key is direction. Each partition contains a CSV file with vehicle tracking data.
Log on to the OSS console and create a bucket. This example uses a bucket named
mc-lakehouse-dlf-oss.Create a root folder named
mc_dlf_oss_pt/in the bucket, then create four subdirectories and upload the corresponding files: DLF uses the root folder name (mc_dlf_oss_pt) as the table name and eachdirection=<value>subdirectory as a partition.Directory and table names can contain only letters, digits, and underscores, and must start with a letter.
Subdirectory File direction=N/vehicle1.csv direction=NE/vehicle2.csv direction=S/vehicle3.csv direction=SW/vehicle4.csv
Step 3: Extract OSS metadata using DLF 1.0
Create a DLF database
Log on to the Data Lake Formation (DLF) console and select your region.Data Lake Formation (DLF) console
In the navigation pane, choose Metadata > Metadata.
On the Metadata page, click the Database tab. In the default catalog, click Create Database and configure the following parameters:
Parameter Required Description Catalog Yes Select default. Database Name Yes Enter a name that starts with a letter, is 1–128 characters long, and contains only letters, digits, and underscores. Example: dlf_oss_csv.Database Description No Enter an optional description. Select Path Yes The OSS path where database metadata is stored. Example: oss://mc-lakehouse-dlf-oss/metadb_save.
Create a metadata extraction task
In the navigation pane, choose Metadata > Metadata Discovery.
Click Create Extraction Task and configure the task in three sections:
Set Extraction Source:
Parameter Required Description Extraction Task Name Yes Enter a custom name for the task. Select OSS Path Yes The OSS path to scan. DLF creates tables and partitions based on the directory structure: oss://<path>/<table-name>/<partition-name>/<file-name>. For this example, selectoss://mc-lakehouse-dlf-oss/mc_dlf_oss_pt/.Exclusion Mode No A regular expression matching OSS directories to exclude from scanning. Parse Format Yes The file format. Options: Automatic Recognition, json, csv, parquet, orc, hudi, delta, avro. Select csv for this example. Set Extraction Target:
Parameter Required Description Catalog Yes Select the target data catalog. Destination Database Yes Select the database created in the previous step. Example: dlf_oss_csv.Destination Table Prefix No A prefix for extracted table names. Must be 1–128 characters and start with a letter. Method to Handle Table Field Update Yes How to handle schema changes when re-running the task: Add Columns and Retain Existing Columns (safe default), Update Table Schema and Generate Table Results Based on the last detected table schema (overwrites existing tables with the same name — this may cause the original tables to be lost or become unavailable), or Ignore Updates and Not Modify Table. Method to Process Deleted OSS Objects Yes Delete Metadata removes table definitions when OSS objects are deleted. Ignore Updates and Not Delete Tables keeps existing metadata. Set Extraction Task:
Parameter Required Description RAM Role Yes The RAM role DLF assumes to access your OSS data. Select AliyunDLFWorkFlowDefaultRole to use the default system role. Execution Policy No Manual (run on demand) or Scheduling (run on a schedule). Extraction Policy Yes Partial Data Extraction scans part of each file — faster but may produce less accurate schemas. Extract All scans every file for maximum accuracy. Click Save and Execute Immediately.
View extracted tables
In the navigation pane, choose Metadata > Metadata.
Click the Database tab, then click the name of your database (
dlf_oss_csv).On the Table List tab, confirm that the extracted table
mc_dlf_oss_ptappears.
If you add data to a new partition later, re-run the extraction task to update the metadata.
Step 4: Create a DLF 1.0 and OSS external data source
Log on to the MaxCompute console and select your region.MaxCompute console
In the navigation pane, choose Manage Configurations > External Data Source.
Click Create External Data Source and configure the following parameters:
Parameter Required Description External Data Source Type Yes Select DLF+OSS. External Data Source Name Yes A name that starts with a letter, contains only lowercase letters, underscores, and digits, and is no more than 128 characters. Example: dlf_oss_csv_new.Description No An optional description. Region Yes Defaults to the current region. DLF Endpoint Yes Defaults to the DLF endpoint for the current region. OSS Endpoint Yes Defaults to the OSS endpoint for the current region. RoleARN Yes The Alibaba Cloud Resource Name (ARN) of the RAM role that has permissions to access both DLF and OSS. To find the ARN, go to the RAM console, choose Identities > Roles, and copy the ARN from the Basic Information section. Example: acs:ram::124****:role/aliyunodpsdefaultrole.Foreign Server Supplemental Properties No Specify additional attributes for the external data source. After you specify the attributes, tasks that use this data source can access the source system based on the defined behavior. For information about supported parameters, see future updates in the official documentation. More parameters will be available as the product evolves. Click OK.
On the External Data Source page, find the new data source and click Details in the Actions column to verify the configuration.
Step 5: Create an external schema
Connect to MaxCompute and run the following SQL:
SET odps.namespace.schema=true;
CREATE EXTERNAL SCHEMA IF NOT EXISTS <external_schema>
WITH <external_data_source>
ON '<dlf_data_catalogue>.dlf_database';Replace the placeholders with your values:
| Placeholder | Description | Example |
|---|---|---|
<external_schema> | A name for the external schema | es_dlf_oss_csv |
<external_data_source> | The external data source created in Step 4. Must be in the same region as the MaxCompute project. | dlf_oss_csv_new |
<dlf_data_catalogue> | The DLF data catalog ID. See Create a data catalog. | 122**** |
dlf_database | The database name in the DLF data catalog. See Databases, tables, and functions. | dlf_oss_csv |
Step 6: Query OSS data with SQL
List the tables available in the external schema:
SET odps.namespace.schema=true; USE schema es_dlf_oss_csv; SHOW tables IN es_dlf_oss_csv;Expected output:
ALIYUN$xxx:mc_dlf_oss_pt OKQuery data from a specific partition:
SET odps.namespace.schema=true; SELECT * FROM <project_name>.es_dlf_oss_csv.mc_dlf_oss_pt WHERE direction='NE';Expected output:
+------------+------------+------------+------------+------------+------------+----------------+------------+ | _c0 | _c1 | _c2 | _c3 | _c4 | _c5 | _c6 | direction | +------------+------------+------------+------------+------------+------------+----------------+------------+ | 1 | 2 | 13 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | NE | | 1 | 3 | 48 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | NE | | 1 | 9 | 4 | 1 | 46.81006 | -92.08174 | 9/15/2014 0:00 | NE | +------------+------------+------------+------------+------------+------------+----------------+------------+
Step 7: Copy data to a MaxCompute table
For better query performance on frequently accessed data, copy it to a MaxCompute internal table using CREATE TABLE AS SELECT (CTAS). Internal tables do not scan OSS files at query time.
Copy a partition to an internal table:
CREATE TABLE vehicle_copy AS SELECT * FROM <project_name>.es_dlf_oss_csv.mc_dlf_oss_pt WHERE direction='NE';Query the internal table to verify the data:
SELECT * FROM vehicle_copy;Expected output:
+------------+------------+------------+------------+------------+------------+----------------+------------+ | _c0 | _c1 | _c2 | _c3 | _c4 | _c5 | _c6 | direction | +------------+------------+------------+------------+------------+------------+----------------+------------+ | 1 | 2 | 13 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | NE | | 1 | 3 | 48 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | NE | | 1 | 9 | 4 | 1 | 46.81006 | -92.08174 | 9/15/2014 0:00 | NE | +------------+------------+------------+------------+------------+------------+----------------+------------+