In the big data field, Alibaba Cloud provides data security solutions, such as user authentication, data permission management, and big data job management, for enterprise users. This topic describes the data security solutions for DataWorks on EMR.

Background information

DataWorks on EMR supports Lightweight Directory Access Protocol (LDAP) authentication. OpenLDAP is integrated with the Hive, Spark ThriftServer, Kyuubi, Presto, and Impala services. Only authenticated users can use the services to query data.

Data security capability: Data permission management

You can use the open source Ranger component and the DLF-Auth component that is provided by Data Lake Formation (DLF) to manage permissions on E-MapReduce (EMR) data.
  • Ranger: You can start Ranger that is deployed in an EMR cluster to manage permissions on data of Hadoop Distributed File System (HDFS), YARN, Hive databases, and Hive tables.
  • DLF-Auth: You can start the DLF-Auth component that is deployed in an EMR cluster to manage permissions on databases, tables, columns, and functions. For more information, see DLF-Auth. You can perform authorization operations that are related to DLF-Auth in DataWorks Security Center. For more information, see Manage permissions on DLF.
Note If you use Object Storage Service (OSS) for storage, you can configure permissions to access OSS objects in the OSS console. DataWorks observes the data permission management settings that you configure for Range, DLF, and OSS.

Data security capability: Node management

DataWorks provides big data development and O&M capabilities and allows you to manage big data computing nodes in modules, such as Workspace Management and Security Center.
  • Workspace Management: DataWorks allows you to use workspaces to manage members that are added to a workspace and configure the visibility and maintainability settings of big data nodes. For more information about workspaces, see Overview. Add Member
  • Security Center: DataWorks allows you to configure access permissions on DLF tables in Security Center. For more information, see Manage permissions on DLF.
  • Association with a compute engine: When you associate an EMR cluster with a workspace, you can specify the identity that is used to access the EMR cluster to run EMR nodes in the production environment. You can specify a node owner, an Alibaba Cloud account, or a RAM user. emrDataWorks allows you to configure mappings between the workspace members and the accounts of the EMR cluster that is associated with the workspace. The mapped EMR cluster account of the identity that you specified is used to run EMR nodes in the EMR cluster. Edit EMR Cluster Configuration

Data security practices: Implement complete data permission management

To process big data business in an efficient manner, multiple users use the same Hadoop account to develop and run nodes. In this case, users and data permissions are not effectively managed. Enhancing data security management capabilities without affecting big data business becomes a big challenge. The following example shows how to implement complete data permission management by using a combination of services, such as LDAP+Ranger or LDAP+DLF-Auth. In this example, LDAP and DLF-Auth are used.

  1. Select the OpenLDAP service, start the service, and then add a user account for OpenLDAP.
  2. Select a service, such as Hive, and enable the OpenLDAP service for Hive. Check whether you can use an LDAP account to log on to the LDAP service and run jobs as expected.
  3. Go to the Workspace Management page, choose Opensource Cluster Management > EMR Config, and then configure mappings between Alibaba Cloud accounts and LDAP accounts.
  4. Modify the configurations of the EMR cluster. In the Modify EMR Cluster dialog box, select Security mode for the Access Mode parameter and configure the Clusteraccessidentity parameter. Commit nodes in DataWorks. Make sure that the nodes can be run as expected.
  5. Go to DataWorks Security Center to configure permissions on DLF. Make sure that the account used to run nodes is granted the required data permissions. Otherwise, the nodes fail due to insufficient permissions.