Learn how compute engines access Data Lake Formation using fine-grained access control - Data Lake Formation

This topic explains how to integrate various Alibaba Cloud compute engines, including Serverless StarRocks, Serverless Spark, and Realtime Compute for Apache Flink, with the Data Lake Formation (DLF) data permission system to enable end-to-end fine-grained access control.

Use cases

Select a compute engine to view the corresponding configuration guide:

Scenario	Access protocols	Use cases
Access DLF from Realtime Compute for Apache Flink	Paimon/Iceberg REST	Streaming data processing, real-time ETL
Access DLF from Serverless StarRocks	Paimon/Iceberg REST	Interactive analysis and ad-hoc queries
Access DLF from Serverless Spark (Paimon/Iceberg)	Paimon/Iceberg REST	Batch processing, ETL, and interactive analysis
Access DLF from Serverless Spark (File access / PVFS)	PVFS	Unstructured data processing
Access DLF from EMR on ECS Spark	Paimon/Iceberg REST	Batch processing and ETL on self-managed clusters

Prerequisites

Before you configure permissions, ensure the following conditions are met:

Data Lake Formation (DLF) has been enabled, and a DLF catalog has been created. For more information, see Authorize and enable DLF.
The required compute engine instances (such as a StarRocks instance, a Spark workspace, or a Flink workspace) have been created.
The compute engine and DLF are in the same region, and the compute engine's Virtual Private Cloud (VPC) has been added to the DLF trusted VPC list. For more information, see Configure a trusted VPC.

Access DLF from Realtime Compute for Flink

Use cases

Use Realtime Compute for Apache Flink and Flink SQL to access data in a Data Lake Formation (DLF) catalog using the Paimon REST or Iceberg REST protocol. This approach is suitable for scenarios such as stream processing, real-time ETL, and Flink CDC data lake ingestion.

Limitations

Paimon REST: Requires VVR engine 11.1 or later.
Iceberg REST: Requires VVR engine 11.2 or later.
Network: The Flink workspace and DLF must be in the same region, and the workspace VPC must be added to the DLF trusted VPC list.

Configure permissions

Realtime Compute for Apache Flink uses the AliyunStreamAsiDefaultRole service role to authenticate with DLF.

Step 1: Grant DLF permissions to Flink service role

Log on to the RAM console with your Alibaba Cloud account or as a RAM administrator.
In the navigation pane on the left, choose Identities > > Roles, and then search for AliyunStreamAsiDefaultRole.
In the Actions column, click Add Permissions. Search for and select AliyunDLFFullAccess, then click Confirm Add Permissions.

Note

If you encounter the ForbiddenException: You are not authorized to do this operation. Action: dlf:GetConfig error when creating a catalog, the role lacks the required DLF permissions. Follow the steps above to grant the permissions.

Step 2: Grant data permissions in DLF

Log on to the Data Lake Formation console.
Go to the target catalog, switch to the Permissions tab, and click Grant.
Select RAM User/RAM Role, and select AliyunStreamAsiDefaultRole from the drop-down list.
Grant the required data permissions, such as Data Reader, Data Editor, or custom permissions, based on your business needs.

Note

By default, the Alibaba Cloud root account has all data permissions. For fine-grained access control, use a RAM user.

Step 3: Add Flink VPC to DLF trusted list

In the navigation pane on the left of the DLF console, click System and Security > > System Security > > Add VPC ID.
In the dialog box, select one or more VPCs to add to the trusted list, and then click OK.

For more information, see Configure a trusted VPC.

Step 4: Create a DLF catalog in Flink

Method 1: Create a catalog using the console

Log on to the Realtime Compute for Apache Flink console and go to the target workspace.
In the navigation pane on the left, click Data Management.
On the Catalogs page, click Create Catalog and select the catalog type.

Method 2: Create a catalog using SQL

In the text editor on the Data Exploration page, enter and run the following SQL statement:

Paimon REST

CREATE CATALOG `flink_catalog_name`
WITH (
  'type' = 'paimon',
  'metastore' = 'rest',
  'token.provider' = 'dlf',
  'uri' = 'http://cn-hangzhou-vpc.dlf.aliyuncs.com',
  'warehouse' = 'dlf_test'
);

Parameter	Description	Required	Example
type	The catalog type. Must be set to `paimon`.	Yes	paimon
metastore	The metastore type. Must be set to `rest`.	Yes	rest
token.provider	The token provider. Must be set to `dlf`.	Yes	dlf
uri	The URI of the DLF REST Catalog Server. The format is `http://<region-id>-vpc.dlf.aliyuncs.com`. For more information, see Endpoints.	Yes	http://cn-hangzhou-vpc.dlf.aliyuncs.com
warehouse	The name of the DLF catalog.	Yes	dlf_test

For more information, see Access DLF from Flink CDC and Access DLF from Flink DataStream.

Iceberg REST

CREATE CATALOG `flink_catalog_name`
WITH (
  'type' = 'iceberg',
  'catalog-type' = 'rest',
  'uri' = 'http://cn-hangzhou-vpc.dlf.aliyuncs.com/iceberg',
  'warehouse' = 'iceberg_test',
  'rest.signing-region' = 'cn-hangzhou',
  'io-impl' = 'org.apache.iceberg.rest.DlfFileIO'
);

Parameter	Description	Required	Example
type	The catalog type. Must be set to `iceberg`.	Yes	iceberg
catalog-type	The catalog implementation type. Must be set to `rest`.	Yes	rest
uri	The URI of the DLF REST Catalog Server is `http://[region-id]-vpc.dlf.aliyuncs.com/iceberg`. Note that the URI must include the `/iceberg` suffix. For more information, see Endpoints.	Yes	`http://cn-hangzhou-vpc.dlf.aliyuncs.com/iceberg`
warehouse	The name of the DLF catalog.	Yes	iceberg_test
rest.signing-region	The ID of the region where DLF is deployed.	Yes	cn-hangzhou
io-impl	The I/O implementation class. Must be set to `org.apache.iceberg.rest.DlfFileIO`.	Yes	org.apache.iceberg.rest.DlfFileIO

Access DLF from Serverless StarRocks

Use cases

Use Serverless StarRocks as a query engine to interactively analyze data in a DLF catalog.

Limitations

This feature requires Serverless StarRocks 3.3.8 or later.
The Serverless StarRocks service and DLF must be in the same VPC.

Permission configuration

To access DLF data, you need to configure permissions at the following three levels:

RAM permissions — Grant the RAM user permissions to access the EMR console and call DLF APIs.
DLF data permissions — Control which data in DLF the RAM user can access.
StarRocks user mapping — Create a user in StarRocks with the same name as the RAM user and grant the USAGE permission on the external catalog.

The core mechanism is RAM username mapping, which associates StarRocks users with DLF permissions.

Step 1: Grant console permissions

Log on to the RAM console with your Alibaba Cloud account or as a RAM administrator.
Create a new RAM user or select an existing one.
Note: The name of the RAM user must comply with StarRocks user naming conventions. It can contain only letters, digits, and underscores. Hyphens and periods are not allowed.
Attach the following permission policies to the RAM user:
- EMR console permissions: Attach AliyunEMRStarRocksFullAccess (administrator) or AliyunEMRStarRocksReadOnlyAccess (read-only) to access the EMR StarRocks console.
- DLF API permissions: Attach AliyunDLFFullAccess or a custom DLF permission policy to call DLF APIs.

For more information, see Grant permissions to a RAM user for EMR Serverless StarRocks.

Step 2: Grant data permissions in DLF

Log on to the Data Lake Formation console.
Go to the target catalog, switch to the Permissions tab, and click Grant Permissions.
Select the RAM user from the previous step and grant the required data permissions, such as Data Reader, Data Editor, or a custom permission.

Step 3: Create a mapped StarRocks user

Log on to the EMR Serverless StarRocks console and connect to your StarRocks instance.
In security center > user management, click Add User:
- User Source: Select RAM User.
- Username: Select the RAM user that you granted permissions to in Step 1.
- Password: Create an 8- to 30-character password that contains uppercase and lowercase letters, digits, and special characters.
- Role: Keep the default public role.
Grant the user the USAGE permission on the External Catalog. On the User Management page, click Authorize in the Actions column of the target user. In the Add Permission panel, select External Catalog and grant the USAGE permission.

Note

DLF is an External Catalog to StarRocks. Other permissions in the internal RBAC of StarRocks, such as SELECT and INSERT, apply only to internal tables and do not affect data access to DLF external tables. The read-write permissions for DLF external tables are controlled entirely by the DLF data permissions configured in Step 2.

For more information, see Manage users in Serverless StarRocks.

Step 4: Create an external catalog

Log on to StarRocks Manager as the RAM user and run the following SQL statement to create an external catalog:

Paimon

CREATE EXTERNAL CATALOG `paimon_catalog`
PROPERTIES (
  'type' = 'paimon',
  'uri' = 'http://cn-hangzhou-vpc.dlf.aliyuncs.com',
  'paimon.catalog.type' = 'rest',
  'paimon.catalog.warehouse' = 'my_catalog',
  'token.provider' = 'dlf'
 );

Parameter	Description	Example
type	The catalog type for StarRocks. The value is fixed to `paimon`.	paimon
uri	The URI for accessing the DLF REST catalog. For more information, see Paimon REST.	http://cn-hangzhou-vpc.dlf.aliyuncs.com
paimon.catalog.type	The Paimon Catalog type is fixed to `rest`.	rest
paimon.catalog.warehouse	The name of the catalog in DLF.	my_catalog
token.provider	The REST service provider is fixed to `dlf`.	dlf

Iceberg

CREATE EXTERNAL CATALOG iceberg_catalog
PROPERTIES ( 
    'type' = 'iceberg',
    'uri' = 'http://cn-hangzhou-vpc.dlf.aliyuncs.com/iceberg',
    'iceberg.catalog.type' = 'dlf_rest',
    'warehouse' = 'iceberg_test',
    'rest.signing-region' = 'cn-hangzhou'
);

Parameter	Description	Example
type	The Catalog type for StarRocks is fixed to `iceberg`.	iceberg
uri	The URI for accessing the DLF REST catalog. For more information, see Iceberg REST.	http://cn-hangzhou-vpc.dlf.aliyuncs.com/iceberg
iceberg.catalog.type	Set the DLF scenario to the fixed value `dlf_rest`.	dlf_rest
warehouse	The name of the DLF catalog.	iceberg_test
rest.signing-region	The region ID of the DLF service.	cn-hangzhou

Note

Iceberg REST currently supports only read-only queries. You cannot write data through StarRocks. For read and write capabilities, use Paimon REST.

Access DLF from Serverless Spark

Use cases

Use Serverless Spark for batch processing, ETL, or interactive analysis to access data in a DLF catalog through the Paimon/Iceberg REST protocol.

Usage limits

Paimon: Serverless Spark versions esr-4.3.0, esr-3.3.0, esr-2.7.0, and later.
Iceberg: Serverless Spark versions esr-4.7.0, esr-3.6.0, and later.

Permission configuration

To access data in DLF, configure permissions at two levels:

RAM permissions — Grant the RAM user permissions to access the EMR console and use DLF APIs.
DLF data permissions — Control which data in DLF the RAM user can access.

When Serverless Spark accesses DLF, it uses the execution role (AliyunEMRSparkJobRunDefaultRole) to impersonate the submitting RAM user for authorization by DLF.

Note

Serverless Spark workspace roles (Guest, DataScience, DataEngineering, and Owner) control operations in the EMR console, such as creating workflows and managing queues, but they do not affect data access in DLF. For information about configuring workspace roles, see Manage Users and Roles.

Step 1: Grant console permissions to a RAM user

Log on to the RAM console by using your Alibaba Cloud account or as a RAM administrator.
Attach the following permission policies to the RAM user:
- EMR console permissions: Attach AliyunEMRServerlessSparkFullAccess (for administrators), AliyunEMRServerlessSparkDeveloperAccess (for developers), or AliyunEMRServerlessSparkReadOnlyAccess (for read-only access).
- DLF API permissions: Attach AliyunDLFFullAccess or a custom DLF permission policy.

For more information, see RAM User Authorization.

Step 2: Grant data permissions in DLF

Log on to the Data Lake Formation console.
Go to the target catalog and grant the required data permissions, such as Data Reader, Data Editor, or custom permissions, to the RAM user or the RAM role associated with the workspace.

Step 3: Bind a DLF catalog to a Spark workspace

Method 1: Bind during workspace creation

When you create a Serverless Spark workspace, enable Use DLF as metadata service and select the target DLF catalog.

Method 2: Bind to an existing workspace

Go to the Data Catalog page of your Serverless Spark workspace and add the target DLF catalog. For more information, see Manage Data Catalogs.

After you bind a DLF catalog, both Livy Gateway and Kyuubi Gateway natively support it as the default data catalog.

Multi-tenancy: Isolate user permissions with Kyuubi tokens

By default, all users in a Serverless Spark workspace share the same RAM identity to access DLF data. To implement multi-tenancy permission isolation, use the token mechanism of Kyuubi Gateway.

How it works

Token generation: A time-limited token is generated in Kyuubi Gateway for a specific RAM user and bound to their identity.
Client authentication: A client, such as Beeline, includes the token and RAM username in the JDBC connection string to connect to Kyuubi Gateway.
Identity proxy: After Kyuubi Gateway validates the token, the Spark engine impersonates that RAM user.
DLF authorization: The Spark engine sends requests to DLF as the impersonated RAM user. DLF then authorizes the request based on that user's permission policies.

Configuration steps

In DLF, grant the corresponding table-level or column-level data permissions to each RAM user who requires independent access.
On the Token Management page of Kyuubi Gateway, create a token for each RAM user.
From your client, use the corresponding token to connect to Kyuubi Gateway:

beeline -u "jdbc:hive2://<endpoint>:<port>/;transportMode=http;user=<RAM-username>;httpPath=cliservice/token/<Token>"

Verify permission isolation

Query an authorized table: SELECT * FROM db.authorized_table LIMIT 10; — Data is returned successfully.
Query an unauthorized table: SELECT * FROM db.unauthorized_table LIMIT 10; — A permission error is returned, such as emr_test doesn't have privilege SELECT on TABLE.

Important

EMR Serverless Spark enables the Data Lake Formation (DLF) metadata cache by default. New permissions take 10 minutes to take effect. To apply the changes immediately, add spark.sql.catalog.lakehouse.cache-enabled false to your Spark configuration.
All RAM users who connect to Kyuubi Gateway must have the Describe permission on the default database. Otherwise, session initialization fails.

For more information, see Permission control for DLF data using Kyuubi tokens.

DLF permission control in Spark

When you access data in Data Lake Formation (DLF) from Serverless Spark, DLF checks permissions based on your RAM identity. DLF data permissions support the following operations:

DLF data permission	Scope	Description
Select	Table-level / Column-level	Controls query access to table data. Supports fine-grained control at the column level.
Update	Table-level	Controls write operations on table data, such as INSERT, INSERT OVERWRITE, and MERGE INTO.
Alter	Catalog / Database / Table-level	Controls modifications to metadata, such as CREATE TABLE and ALTER TABLE.
Drop	Catalog / Database / Table-level	Controls the deletion of resources.
Grant	Catalog / Database / Table-level	Controls whether permissions can be granted to other users.
ALL	Catalog / Database / Table-level	Includes all the permissions above.

You can use the preset permission templates in DLF to quickly grant permissions:

Data Reader: Grants read-only permission for scenarios that only require data queries.
Data Editor: Grants read-write permission for ETL scenarios that require data reads and writes.

Accessing DLF files via PVFS

Use cases

Use Serverless Spark to directly access file data managed by Data Lake Formation (DLF) through the Paimon virtual file system (PVFS). This method is ideal for processing unstructured data.

Version requirements

This feature requires Serverless Spark versions esr-3.5.0, esr-2.9.0, esr-4.6.0, or later.

Procedure

The process is similar to the Paimon REST method. After you bind the DLF catalog, use the pvfs:// protocol to access files.

Steps 1 to 3

Follow the procedure from the previous scenario to configure RAM permissions, authorize DLF data access, and bind the catalog.

Step 4: Access files through PVFS

In a notebook or PySpark, use the pvfs:// path to access DLF-managed files:

df = spark.read.option("delimiter", ",").option("header", True) \
    .csv("pvfs://<catalog_name>/default/object_table/employee.csv")
df.show(5)

Access DLF from EMR on ECS Spark

Use cases

Access a DLF catalog from an EMR on ECS cluster by using the Spark engine and the Paimon REST protocol.

Version requirements

You must use an EMR cluster of version 5.12.0 or later and select the Spark3 and Paimon components.

Procedure

In an EMR on ECS environment, the AliyunECSInstanceForEMRRole RAM role handles permissions.

Step 1: Grant RAM permissions to the ECS role

Log on to the RAM console.
Search for AliyunECSInstanceForEMRRole.
Attach the AliyunDLFFullAccess permission policy to the role.

Step 2: Grant DLF data permissions

Log on to the Data Lake Formation console.
On the target catalog's Permissions tab, click Authorize.
Select RAM User/RAM Role, and then select AliyunECSInstanceForEMRRole from the drop-down list.
Select a predefined permission type (Data Reader or Data Editor) or Custom Permissions.

If AliyunECSInstanceForEMRRole does not appear in the drop-down list, click Sync on the User Management page.

Step 3: Upgrade dependencies

Upgrade the Paimon dependency to version 1.1 or later. See the following documents for download and deployment instructions:

Permission architecture

When a compute engine accesses data in DLF, requests sequentially pass through the following four layers of permission checks:

Permission layer	Description	Configuration location
① RAM permission	Controls user access to the engine console and DLF APIs.	RAM console
② Engine identity layer	Maps engine users to RAM identities for authorization in DLF.	Engine consoles
③ DLF data permission	Enables fine-grained control over user access to catalogs, databases, tables, and columns.	DLF console
④ Network security	Restricts network access to trusted VPCs.	DLF console

The identity mapping method differs for each compute engine:

StarRocks: Maps identities by creating a user in StarRocks with the same name as the RAM user. Each user is authorized independently.
Spark: Uses the execution role (AliyunEMRSparkJobRunDefaultRole) to impersonate the RAM identity of the job submitter. In multi-tenant scenarios, Kyuubi tokens provide user-level isolation.
Flink: Uses the service role (AliyunStreamAsiDefaultRole) as a unified proxy.

DLF data permissions

DLF data permissions support the following operation types:

Operation	Scope	Description
Select	table-level / column-level	Controls query access to table data, with support for fine-grained, column-level control.
Update	table-level	Controls write operations on table data, such as INSERT, INSERT OVERWRITE, and MERGE INTO.
Alter	Catalog / database / table-level	Controls modifications to metadata, such as CREATE TABLE and ALTER TABLE.
Drop	Catalog / database / table-level	Controls the deletion of Catalogs, databases, and tables.
Grant	Catalog / database / table-level	Controls the delegation of permissions to other users.
ALL	Catalog / database / table-level	Includes all permissions listed above.

DLF provides the following predefined permission templates:

Data Reader: A read-only permission ideal for use cases that only require data query access.
Data Editor: A read-write permission ideal for ETL workloads.

Internal engine and DLF permissions

StarRocks internal RBAC permissions (such as the db_admin and user_admin roles) only control operation permissions on StarRocks internal tables. To StarRocks, DLF is an external catalog, and its data read-write permissions are controlled entirely by DLF data permissions. A user only needs the USAGE permission on the external catalog to use the DLF catalog.
Spark workspace roles, such as Guest, DataEngineering, and Owner, control EMR console operations like creating workflows and managing queues. These roles do not affect DLF data access.
Flink workspace permissions control Flink console operations, such as managing jobs and deployments. These permissions do not affect DLF data access. When Flink accesses DLF, its identity is determined by the service role.

Considerations and limitations

Column-level permission: Only internal Paimon tables support column-level permissions. The Paimon engine must be version 1.2 (1-ali-12.0) or later.
Permission intersection rule: If both a user and their assigned roles are granted the Column Select permission, the accessible columns are the intersection of the user's and the roles' permissions.
External catalog mapping: Creating or deleting an external catalog affects only the mapping relationship, not the actual data in DLF.
Separation of StarRocks and DLF permissions: Role-Based Access Control (RBAC) in StarRocks controls permissions for internal tables only. You can configure data access permissions for DLF external tables only in the Data Lake Formation console. To use a DLF catalog, StarRocks users need only the USAGE permission on an external catalog.

Troubleshoot permission errors

If you receive a permission error, identify its type from the error message:

Error type	Example	Solution
Missing API permission	`You are not authorized to do this operation. Action: dlf:<action>.`	Contact your RAM administrator to grant the necessary DLF action permission to the user or role.
Missing DLF data permission	`User <principal> doesn't have privilege <access_type> on <resource_type> <resource_name>`	Contact a DLF `super_administrator` or `admin` to grant the required data permission on the resource.
Untrusted VPC	`Source vpc vpc-xxxxxx is not trusted`	In the DLF console, add the VPC to the trusted list.
Missing Flink role permissions	`ForbiddenException: Forbidden: You are not authorized to do this operation. Action: dlf:GetConfig`	Grant the `AliyunStreamAsiDefaultRole` role the `AliyunDLFFullAccess` permission