This topic describes how to use the Data Lake Formation (DLF) and E-MapReduce (EMR) products to manage permissions on data lakes in specific business scenarios.
Background information
Metadata management and data permission control are foundational capabilities provided by the DLF product for building data lakes. By accessing lake data through the metadata view provided by DLF, we can resolve issues related to metadata consistency within the data lake and address challenges associated with data sharing. Various engines can obtain enterprise-level permission management capabilities by integrating with DLF permissions.
Concepts
EMR: Alibaba Cloud E-MapReduce product. For more information, see What is EMR on ECS?.
DLF Catalog: DLF data catalog (DLF), the top-level entity of the DLF metadata architecture, can include metadata information such as databases and data tables. For more information, see Data Catalog.
DLF Data Permissions: The data permission system provided by DLF for data lakes, supporting fine-grained permission control in four dimensions: databases, data tables, data columns, and functions. For more information, see Overview.
Business Scenarios
A company has an EMR cluster with different engines such as the Hive, Spark, Presto, and Impala, and expects to use unified data permissions for different users in actual business scenarios. The main situations are as follows:
Super Administrator
Has all permissions on data lake data and the ability to assign permissions to others.
Business A Data Administrator
Has all data usage and access permissions related to db_a of business A and the ability to assign database permissions to others.
Business A Data Developer
Has all data usage and access permissions related to db_a of business A.
Business A Data Analyst
Has access permissions to some columns of some tables in db_a related to business A, such as access to col1 and col2 in table1.
Procedure
Create an EMR cluster and use DLF as metadata.
Log on to the E-MapReduce Console.
Create an E-MapReduce cluster with the following options:
Business Scenario: Select Data Lake.
Optional Services: At least select the Hive and DLF-Auth components. Other components can be selected based on business needs.
Metadata: Select DLF Unified Metadata.
DLF Catalog: Select the default DLF Catalog or create a new data catalog. This example uses catalog_test.
Continue with other configurations to complete the creation of the EMR cluster.
NoteIf you already have an E-MapReduce cluster but have not installed the DLF-Auth component, you can add the DLF-Auth component through the Add Service method within the EMR cluster, then use DLF data permissions.
If you already have an E-MapReduce cluster but the Hive metadata is not using DLF, you can migrate the metadata before using DLF database permission. You can contact us by DingTalk group: 33719678.
Initialize the relevant databases and data tables.
Log on to the EMR cluster. For more information, see Log on to a cluster.
Connect to Hive SQL through Beeline.
beeline -u jdbc:hive2://<primary node name>:10000
Execute the following statements to initialize data and create test data.
--Create database and table create database db_a; create table db_a.table1( col1 string, col2 string, col3 string ); create table db_a.table2( col1 string, col2 string, col3 string ); create database db_b; create table db_b.table1( col1 string, col2 string, col3 string ); --Initialize test data --db_a.table1 insert overwrite table db_a.table1 values('1','aliyun','emrA1'),('2','aliyun','dlfA1'); --db_a.table2 insert overwrite table db_a.table2 values('1','aliyun','emrA2'),('2','aliyun','dlfA2'); --db_b.table1 insert overwrite table db_b.table1 values('1','aliyun','emrB1'),('2','aliyun','dlfB1');
Initialize the RAM users required for each role login.
Create a new RAM user for Super Administrator: dlf_data_admin.
Create a new RAM user for Business A Data Administrator: dlf_dba_admin.
Create a new RAM user for Business A Data Developer: dlf_dba_dev.
Create a new RAM user for Business A Data Analyst: dlf_dba_analyst.
Enable data permission control.
Complete the following two steps to officially enable data permission control for the EMR cluster:
Enable data permission control in the EMR cluster. For more information, see DLF-Auth.
Enable permission control for the Catalog in DLF. For more information, see Configure permissions.
NoteFor production use, it is recommended to enable LDAP authentication to ensure that user identities are legally verified. The following example is a simple explanation of permission issues and does not enable LDAP authentication, so no password is required when connecting to Beeline.
Grant the Super Administrator DLF console authorization permissions and access permissions for all data.
Log on to the Data Lake Formation Console.
In the left-side navigation pane, choose
.In the admin role, add the user dlf_data_admin. Then, dlf_data_admin has the administrator permissions to manage all data in DLF. You can configure relevant data permissions for any user.
If dlf_data_admin needs to configure data permissions for RAM users in the DLF console, you must add permission policies for dlf_data_admin in the RAM Console: AliyunDLFFullAccess, AliyunRAMReadOnlyAccess.
Log on to the EMR cluster, connect to Hive by using dlf_data_admin, and execute HiveSQL.
beeline -u jdbc:hive2://<primary node name>:10000 -n dlf_data_admin
select * from db_a.table1; select * from db_b.table1;
Once the above SQL queries are successful, and the user dlf_data_admin has access permissions to all databases and data tables.
Grant the Business A Data Administrator DLF console authorization permissions for the db_a database and access permissions for the db_a database data.
Switch the Alibaba Cloud website login user to the dlf_data_admin user, and use this account to authorize data for other users.
Log on to the Data Lake Formation Console.
In the left-side navigation pane, choose
, and click Add Permission.Enter the following information:
Principal Type: RAM User/Role.
Choose Principal: dlf_dba_admin.
Resources: Resource Authorization.
SelectCatalog List: catalog_test.
Enter an item.Database: db_a.
Permissions:
Database-Data Permission: ALL.
Database-Granted Permission: ALL.
All Objects in Database-Data Permission: ALL.
All Objects in Database-Granted Permission: ALL.
Click OK to save the authorization information.
If dlf_dba_admin needs to configure data permissions for RAM users in the DLF console, you must add permission policies for dlf_dba_admin in the RAM Console: Aliyundlffullaccess, Aliyunramreadonlyaccess.
Log on to the EMR cluster, connect to Hive by using dlf_dba_admin, and execute HiveSQL.
beeline -u jdbc:hive2://<primary node name>:10000 -n dlf_dba_admin
select * from db_a.table1; select * from db_b.table1;
Once the first SQL query is successful, the user dlf_dba_admin has all permissions of the db_a database and all resources within the db_a database.
Once the second SQL query fails, the user dlf_dba_admin does not have permissions of the db_a database and all resources within the db_a database.
Grant the Business A Data Developer query and modification permissions for the db_a database data.
Switch the Alibaba Cloud website login user to the dlf_dba_admin user, and use this account to authorize data for other users.
Log on to the Data Lake Formation Console.
In the left-side navigation pane, choose Data Permission>Data Permissions, and click Add Permission.
Enter the following information:
Principal Type: RAM User/role.
Choose Principal: dlf_dba_dev.
Resources: Resource Authorization.
SelectCatalog List: catalog_test.
Enter an item.Database: db_a.
Permissions:
Database-Data Permission: ALL.
Database-Granted Permission: ALL.
All Objects in Database-Data Permission: ALL.
All Objects in Database-Granted Permission: ALL.
Click OK to save the authorization information.
Log on to the EMR cluster, connect to Hive by using dlf_dba_dev, and execute HiveSQL.
beeline -u jdbc:hive2://<primary node name>:10000 -n dlf_dba_dev
select * from db_a.table1; insert into table db_a.table1 values('3','aliyun','emrA1'),('4','aliyun','dlfA1'); select * from db_b.table1; insert into table db_b.table1 values('3','aliyun','emrA1'),('4','aliyun','dlfA1');
Once the first and second SQL queries are successful, the user dlf_dba_dev has query and modification permissions for the db_a database and all resources within the db_a database.
Once the third and fourth SQL queries fail, the user dlf_dba_dev does not have query and modification permissions for the db_b database and all resources within the db_a database.
Grant the Business A Data Analyst access permissions for table1(col1, col2) in the db_a database.
On the Alibaba Cloud website, log in as the dlf_dba_admin user, and use this account to authorize data for other users.
Log on to the Data Lake Formation Console.
In the left-side navigation pane, choose Data Permission>Data Permissions, and click Add Permission.
Enter the following information:
Principal Type: RAM User/role.
Choose Principal: dlf_dba_analyst.
Resources: Resource Authorization.
Resource Type: Column.
SelectCatalog List: catalog_test.
SelectDatabase: db_a.
SelectTable: table1.
Permissions:
DataColumn-Data Permission: ALL.
Click OK to save the authorization information.
Log on to the EMR cluster, connect to Hive by using dlf_dba_analyst, and execute HiveSQL.
beeline -u jdbc:hive2://<primary node name>:10000 -n dlf_dba_analyst
select * from db_a.table1; select col1,col2 from db_a.table1; insert into table db_a.table1 values('5','aliyun','emrA1'),('6','aliyun','dlfA1'); select * from db_b.table1;
Once the second SQL query is successful, and the user dlf_dba_dev has query permissions for db_a.table1(col1, col2).
Once the first SQL query fails, and the user dlf_dba_dev does not have query permissions for db_a.table1(col3).
Once the third SQL query fails, and the user dlf_dba_dev does not have modification permissions for db_a.table1 data.
Once the fourth SQL query fails, and the user dlf_dba_dev does not have query permissions for the db_b database data.