This topic explains how to integrate various Alibaba Cloud compute engines, including Serverless StarRocks, Serverless Spark, and Realtime Compute for Apache Flink, with the Data Lake Formation (DLF) data permission system to enable end-to-end fine-grained access control.
Use cases
Select a compute engine to view the corresponding configuration guide:
Scenario | Access protocols | Use cases |
Paimon/Iceberg REST | Streaming data processing, real-time ETL | |
Paimon/Iceberg REST | Interactive analysis and ad-hoc queries | |
Paimon/Iceberg REST | Batch processing, ETL, and interactive analysis | |
PVFS | Unstructured data processing | |
Paimon/Iceberg REST | Batch processing and ETL on self-managed clusters |
Prerequisites
Before you configure permissions, ensure the following conditions are met:
Data Lake Formation (DLF) has been enabled, and a DLF catalog has been created. For more information, see Authorize and enable DLF.
The required compute engine instances (such as a StarRocks instance, a Spark workspace, or a Flink workspace) have been created.
The compute engine and DLF are in the same region, and the compute engine's Virtual Private Cloud (VPC) has been added to the DLF trusted VPC list. For more information, see Configure a trusted VPC.
Access DLF from Realtime Compute for Flink
Use cases
Use Realtime Compute for Apache Flink and Flink SQL to access data in a Data Lake Formation (DLF) catalog using the Paimon REST or Iceberg REST protocol. This approach is suitable for scenarios such as stream processing, real-time ETL, and Flink CDC data lake ingestion.
Limitations
Paimon REST: Requires VVR engine 11.1 or later.
Iceberg REST: Requires VVR engine 11.2 or later.
Network: The Flink workspace and DLF must be in the same region, and the workspace VPC must be added to the DLF trusted VPC list.
Configure permissions
Realtime Compute for Apache Flink uses the AliyunStreamAsiDefaultRole service role to authenticate with DLF.
Step 1: Grant DLF permissions to Flink service role
Log on to the RAM console with your Alibaba Cloud account or as a RAM administrator.
In the navigation pane on the left, choose , and then search for AliyunStreamAsiDefaultRole.
In the Actions column, click Add Permissions. Search for and select
AliyunDLFFullAccess, then click Confirm Add Permissions.
If you encounter the ForbiddenException: You are not authorized to do this operation. Action: dlf:GetConfig error when creating a catalog, the role lacks the required DLF permissions. Follow the steps above to grant the permissions.
Step 2: Grant data permissions in DLF
Log on to the Data Lake Formation console.
Go to the target catalog, switch to the Permissions tab, and click Grant.
Select RAM User/RAM Role, and select AliyunStreamAsiDefaultRole from the drop-down list.
Grant the required data permissions, such as Data Reader, Data Editor, or custom permissions, based on your business needs.
By default, the Alibaba Cloud root account has all data permissions. For fine-grained access control, use a RAM user.
Step 3: Add Flink VPC to DLF trusted list
In the navigation pane on the left of the DLF console, click .
In the dialog box, select one or more VPCs to add to the trusted list, and then click OK.
For more information, see Configure a trusted VPC.
Step 4: Create a DLF catalog in Flink
Method 1: Create a catalog using the console
Log on to the Realtime Compute for Apache Flink console and go to the target workspace.
In the navigation pane on the left, click Data Management.
On the Catalogs page, click Create Catalog and select the catalog type.
Method 2: Create a catalog using SQL
In the text editor on the Data Exploration page, enter and run the following SQL statement:
Paimon REST
CREATE CATALOG `flink_catalog_name`
WITH (
'type' = 'paimon',
'metastore' = 'rest',
'token.provider' = 'dlf',
'uri' = 'http://cn-hangzhou-vpc.dlf.aliyuncs.com',
'warehouse' = 'dlf_test'
);Parameter | Description | Required | Example |
type | The catalog type. Must be set to | Yes | paimon |
metastore | The metastore type. Must be set to | Yes | rest |
token.provider | The token provider. Must be set to | Yes | dlf |
uri | The URI of the DLF REST Catalog Server. The format is | Yes | http://cn-hangzhou-vpc.dlf.aliyuncs.com |
warehouse | The name of the DLF catalog. | Yes | dlf_test |
For more information, see Access DLF from Flink CDC and Access DLF from Flink DataStream.
Iceberg REST
CREATE CATALOG `flink_catalog_name`
WITH (
'type' = 'iceberg',
'catalog-type' = 'rest',
'uri' = 'http://cn-hangzhou-vpc.dlf.aliyuncs.com/iceberg',
'warehouse' = 'iceberg_test',
'rest.signing-region' = 'cn-hangzhou',
'io-impl' = 'org.apache.iceberg.rest.DlfFileIO'
);Parameter | Description | Required | Example |
type | The catalog type. Must be set to | Yes | iceberg |
catalog-type | The catalog implementation type. Must be set to | Yes | rest |
uri | The URI of the DLF REST Catalog Server is | Yes |
|
warehouse | The name of the DLF catalog. | Yes | iceberg_test |
rest.signing-region | The ID of the region where DLF is deployed. | Yes | cn-hangzhou |
io-impl | The I/O implementation class. Must be set to | Yes | org.apache.iceberg.rest.DlfFileIO |
Access DLF from Serverless StarRocks
Use cases
Use Serverless StarRocks as a query engine to interactively analyze data in a DLF catalog.
Limitations
This feature requires Serverless StarRocks 3.3.8 or later.
The Serverless StarRocks service and DLF must be in the same VPC.
Permission configuration
To access DLF data, you need to configure permissions at the following three levels:
RAM permissions — Grant the RAM user permissions to access the EMR console and call DLF APIs.
DLF data permissions — Control which data in DLF the RAM user can access.
StarRocks user mapping — Create a user in StarRocks with the same name as the RAM user and grant the
USAGEpermission on the external catalog.
The core mechanism is RAM username mapping, which associates StarRocks users with DLF permissions.
Step 1: Grant console permissions
Log on to the RAM console with your Alibaba Cloud account or as a RAM administrator.
Create a new RAM user or select an existing one.
Note: The name of the RAM user must comply with StarRocks user naming conventions. It can contain only letters, digits, and underscores. Hyphens and periods are not allowed.
Attach the following permission policies to the RAM user:
EMR console permissions: Attach
AliyunEMRStarRocksFullAccess(administrator) orAliyunEMRStarRocksReadOnlyAccess(read-only) to access the EMR StarRocks console.DLF API permissions: Attach
AliyunDLFFullAccessor a custom DLF permission policy to call DLF APIs.
For more information, see Grant permissions to a RAM user for EMR Serverless StarRocks.
Step 2: Grant data permissions in DLF
Log on to the Data Lake Formation console.
Go to the target catalog, switch to the Permissions tab, and click Grant Permissions.
Select the RAM user from the previous step and grant the required data permissions, such as Data Reader, Data Editor, or a custom permission.
Step 3: Create a mapped StarRocks user
Log on to the EMR Serverless StarRocks console and connect to your StarRocks instance.
In security center > user management, click Add User:
User Source: Select RAM User.
Username: Select the RAM user that you granted permissions to in Step 1.
Password: Create an 8- to 30-character password that contains uppercase and lowercase letters, digits, and special characters.
Role: Keep the default
publicrole.
Grant the user the USAGE permission on the External Catalog. On the User Management page, click Authorize in the Actions column of the target user. In the Add Permission panel, select External Catalog and grant the
USAGEpermission.
DLF is an External Catalog to StarRocks. Other permissions in the internal RBAC of StarRocks, such as SELECT and INSERT, apply only to internal tables and do not affect data access to DLF external tables. The read-write permissions for DLF external tables are controlled entirely by the DLF data permissions configured in Step 2.
For more information, see Manage users in Serverless StarRocks.
Step 4: Create an external catalog
Log on to StarRocks Manager as the RAM user and run the following SQL statement to create an external catalog:
Paimon
CREATE EXTERNAL CATALOG `paimon_catalog`
PROPERTIES (
'type' = 'paimon',
'uri' = 'http://cn-hangzhou-vpc.dlf.aliyuncs.com',
'paimon.catalog.type' = 'rest',
'paimon.catalog.warehouse' = 'my_catalog',
'token.provider' = 'dlf'
);Parameter | Description | Example |
type | The catalog type for StarRocks. The value is fixed to | paimon |
uri | The URI for accessing the DLF REST catalog. For more information, see Paimon REST. | http://cn-hangzhou-vpc.dlf.aliyuncs.com |
paimon.catalog.type | The Paimon Catalog type is fixed to | rest |
paimon.catalog.warehouse | The name of the catalog in DLF. | my_catalog |
token.provider | The REST service provider is fixed to | dlf |
Iceberg
CREATE EXTERNAL CATALOG iceberg_catalog
PROPERTIES (
'type' = 'iceberg',
'uri' = 'http://cn-hangzhou-vpc.dlf.aliyuncs.com/iceberg',
'iceberg.catalog.type' = 'dlf_rest',
'warehouse' = 'iceberg_test',
'rest.signing-region' = 'cn-hangzhou'
);Parameter | Description | Example |
type | The Catalog type for StarRocks is fixed to | iceberg |
uri | The URI for accessing the DLF REST catalog. For more information, see Iceberg REST. | http://cn-hangzhou-vpc.dlf.aliyuncs.com/iceberg |
iceberg.catalog.type | Set the DLF scenario to the fixed value | dlf_rest |
warehouse | The name of the DLF catalog. | iceberg_test |
rest.signing-region | The region ID of the DLF service. | cn-hangzhou |
Iceberg REST currently supports only read-only queries. You cannot write data through StarRocks. For read and write capabilities, use Paimon REST.
Access DLF from Serverless Spark
Use cases
Use Serverless Spark for batch processing, ETL, or interactive analysis to access data in a DLF catalog through the Paimon/Iceberg REST protocol.
Usage limits
Paimon: Serverless Spark versions esr-4.3.0, esr-3.3.0, esr-2.7.0, and later.
Iceberg: Serverless Spark versions esr-4.7.0, esr-3.6.0, and later.
Permission configuration
To access data in DLF, configure permissions at two levels:
RAM permissions — Grant the RAM user permissions to access the EMR console and use DLF APIs.
DLF data permissions — Control which data in DLF the RAM user can access.
When Serverless Spark accesses DLF, it uses the execution role (AliyunEMRSparkJobRunDefaultRole) to impersonate the submitting RAM user for authorization by DLF.
Serverless Spark workspace roles (Guest, DataScience, DataEngineering, and Owner) control operations in the EMR console, such as creating workflows and managing queues, but they do not affect data access in DLF. For information about configuring workspace roles, see Manage Users and Roles.
Step 1: Grant console permissions to a RAM user
Log on to the RAM console by using your Alibaba Cloud account or as a RAM administrator.
Attach the following permission policies to the RAM user:
EMR console permissions: Attach
AliyunEMRServerlessSparkFullAccess(for administrators),AliyunEMRServerlessSparkDeveloperAccess(for developers), orAliyunEMRServerlessSparkReadOnlyAccess(for read-only access).DLF API permissions: Attach
AliyunDLFFullAccessor a custom DLF permission policy.
For more information, see RAM User Authorization.
Step 2: Grant data permissions in DLF
Log on to the Data Lake Formation console.
Go to the target catalog and grant the required data permissions, such as Data Reader, Data Editor, or custom permissions, to the RAM user or the RAM role associated with the workspace.
Step 3: Bind a DLF catalog to a Spark workspace
Method 1: Bind during workspace creation
When you create a Serverless Spark workspace, enable Use DLF as metadata service and select the target DLF catalog.
Method 2: Bind to an existing workspace
Go to the Data Catalog page of your Serverless Spark workspace and add the target DLF catalog. For more information, see Manage Data Catalogs.
After you bind a DLF catalog, both Livy Gateway and Kyuubi Gateway natively support it as the default data catalog.
Multi-tenancy: Isolate user permissions with Kyuubi tokens
By default, all users in a Serverless Spark workspace share the same RAM identity to access DLF data. To implement multi-tenancy permission isolation, use the token mechanism of Kyuubi Gateway.
How it works
Token generation: A time-limited token is generated in Kyuubi Gateway for a specific RAM user and bound to their identity.
Client authentication: A client, such as Beeline, includes the token and RAM username in the JDBC connection string to connect to Kyuubi Gateway.
Identity proxy: After Kyuubi Gateway validates the token, the Spark engine impersonates that RAM user.
DLF authorization: The Spark engine sends requests to DLF as the impersonated RAM user. DLF then authorizes the request based on that user's permission policies.
Configuration steps
In DLF, grant the corresponding table-level or column-level data permissions to each RAM user who requires independent access.
On the Token Management page of Kyuubi Gateway, create a token for each RAM user.
From your client, use the corresponding token to connect to Kyuubi Gateway:
beeline -u "jdbc:hive2://<endpoint>:<port>/;transportMode=http;user=<RAM-username>;httpPath=cliservice/token/<Token>"Verify permission isolation
Query an authorized table:
SELECT * FROM db.authorized_table LIMIT 10;— Data is returned successfully.Query an unauthorized table:
SELECT * FROM db.unauthorized_table LIMIT 10;— A permission error is returned, such asemr_test doesn't have privilege SELECT on TABLE.
EMR Serverless Spark enables the Data Lake Formation (DLF) metadata cache by default. New permissions take 10 minutes to take effect. To apply the changes immediately, add
spark.sql.catalog.lakehouse.cache-enabled falseto your Spark configuration.All RAM users who connect to Kyuubi Gateway must have the Describe permission on the
defaultdatabase. Otherwise, session initialization fails.
For more information, see Permission control for DLF data using Kyuubi tokens.
DLF permission control in Spark
When you access data in Data Lake Formation (DLF) from Serverless Spark, DLF checks permissions based on your RAM identity. DLF data permissions support the following operations:
DLF data permission | Scope | Description |
Select | Table-level / Column-level | Controls query access to table data. Supports fine-grained control at the column level. |
Update | Table-level | Controls write operations on table data, such as INSERT, INSERT OVERWRITE, and MERGE INTO. |
Alter | Catalog / Database / Table-level | Controls modifications to metadata, such as CREATE TABLE and ALTER TABLE. |
Drop | Catalog / Database / Table-level | Controls the deletion of resources. |
Grant | Catalog / Database / Table-level | Controls whether permissions can be granted to other users. |
ALL | Catalog / Database / Table-level | Includes all the permissions above. |
You can use the preset permission templates in DLF to quickly grant permissions:
Data Reader: Grants read-only permission for scenarios that only require data queries.
Data Editor: Grants read-write permission for ETL scenarios that require data reads and writes.
Accessing DLF files via PVFS
Use cases
Use Serverless Spark to directly access file data managed by Data Lake Formation (DLF) through the Paimon virtual file system (PVFS). This method is ideal for processing unstructured data.
Version requirements
This feature requires Serverless Spark versions esr-3.5.0, esr-2.9.0, esr-4.6.0, or later.
Procedure
The process is similar to the Paimon REST method. After you bind the DLF catalog, use the pvfs:// protocol to access files.
Steps 1 to 3
Follow the procedure from the previous scenario to configure RAM permissions, authorize DLF data access, and bind the catalog.
Step 4: Access files through PVFS
In a notebook or PySpark, use the pvfs:// path to access DLF-managed files:
df = spark.read.option("delimiter", ",").option("header", True) \
.csv("pvfs://<catalog_name>/default/object_table/employee.csv")
df.show(5)Access DLF from EMR on ECS Spark
Use cases
Access a DLF catalog from an EMR on ECS cluster by using the Spark engine and the Paimon REST protocol.
Version requirements
You must use an EMR cluster of version 5.12.0 or later and select the Spark3 and Paimon components.
Procedure
In an EMR on ECS environment, the AliyunECSInstanceForEMRRole RAM role handles permissions.
Step 1: Grant RAM permissions to the ECS role
Log on to the RAM console.
Search for AliyunECSInstanceForEMRRole.
Attach the
AliyunDLFFullAccesspermission policy to the role.
Step 2: Grant DLF data permissions
Log on to the Data Lake Formation console.
On the target catalog's Permissions tab, click Authorize.
Select RAM User/RAM Role, and then select AliyunECSInstanceForEMRRole from the drop-down list.
Select a predefined permission type (Data Reader or Data Editor) or Custom Permissions.
If AliyunECSInstanceForEMRRole does not appear in the drop-down list, click Sync on the User Management page.
Step 3: Upgrade dependencies
Upgrade the Paimon dependency to version 1.1 or later. See the following documents for download and deployment instructions:
Permission architecture
When a compute engine accesses data in DLF, requests sequentially pass through the following four layers of permission checks:
Permission layer | Description | Configuration location |
① RAM permission | Controls user access to the engine console and DLF APIs. | RAM console |
② Engine identity layer | Maps engine users to RAM identities for authorization in DLF. | Engine consoles |
③ DLF data permission | Enables fine-grained control over user access to catalogs, databases, tables, and columns. | DLF console |
④ Network security | Restricts network access to trusted VPCs. | DLF console |
The identity mapping method differs for each compute engine:
StarRocks: Maps identities by creating a user in StarRocks with the same name as the RAM user. Each user is authorized independently.
Spark: Uses the execution role (
AliyunEMRSparkJobRunDefaultRole) to impersonate the RAM identity of the job submitter. In multi-tenant scenarios, Kyuubi tokens provide user-level isolation.Flink: Uses the service role (
AliyunStreamAsiDefaultRole) as a unified proxy.
DLF data permissions
DLF data permissions support the following operation types:
Operation | Scope | Description |
Select | table-level / column-level | Controls query access to table data, with support for fine-grained, column-level control. |
Update | table-level | Controls write operations on table data, such as INSERT, INSERT OVERWRITE, and MERGE INTO. |
Alter | Catalog / database / table-level | Controls modifications to metadata, such as CREATE TABLE and ALTER TABLE. |
Drop | Catalog / database / table-level | Controls the deletion of Catalogs, databases, and tables. |
Grant | Catalog / database / table-level | Controls the delegation of permissions to other users. |
ALL | Catalog / database / table-level | Includes all permissions listed above. |
DLF provides the following predefined permission templates:
Data Reader: A read-only permission ideal for use cases that only require data query access.
Data Editor: A read-write permission ideal for ETL workloads.
Internal engine and DLF permissions
StarRocks internal RBAC permissions (such as the
db_adminanduser_adminroles) only control operation permissions on StarRocks internal tables. To StarRocks, DLF is an external catalog, and its data read-write permissions are controlled entirely by DLF data permissions. A user only needs theUSAGEpermission on the external catalog to use the DLF catalog.Spark workspace roles, such as Guest, DataEngineering, and Owner, control EMR console operations like creating workflows and managing queues. These roles do not affect DLF data access.
Flink workspace permissions control Flink console operations, such as managing jobs and deployments. These permissions do not affect DLF data access. When Flink accesses DLF, its identity is determined by the service role.
Considerations and limitations
Column-level permission: Only internal Paimon tables support column-level permissions. The Paimon engine must be version 1.2 (1-ali-12.0) or later.
Permission intersection rule: If both a user and their assigned roles are granted the
Column Selectpermission, the accessible columns are the intersection of the user's and the roles' permissions.External catalog mapping: Creating or deleting an external catalog affects only the mapping relationship, not the actual data in DLF.
Separation of StarRocks and DLF permissions: Role-Based Access Control (RBAC) in StarRocks controls permissions for internal tables only. You can configure data access permissions for DLF external tables only in the Data Lake Formation console. To use a DLF catalog, StarRocks users need only the
USAGEpermission on an external catalog.
Troubleshoot permission errors
If you receive a permission error, identify its type from the error message:
Error type | Example | Solution |
Missing API permission |
| Contact your RAM administrator to grant the necessary DLF action permission to the user or role. |
Missing DLF data permission |
| Contact a DLF |
Untrusted VPC |
| In the DLF console, add the VPC to the trusted list. |
Missing Flink role permissions |
| Grant the |