This topic describes how to create an E-MapReduce (EMR) Spark SQL node. EMR Spark SQL nodes allow you to use the distributed SQL query engine to process structured data. This helps improve the efficiency of jobs.
Prerequisites
- An Alibaba Cloud EMR cluster is created, and an inbound rule that contains the following
content is added to the security group to which the cluster belongs.
- Action: Allow
- Protocol type: Custom TCP
- Port range: 8898/8898
- Authorization object: 100.104.0.0/16
- An EMR compute engine instance is associated with your workspace. The EMR folder is displayed only after you associate an EMR compute engine instance with the workspace on the Workspace Management page. For more information, see Configure a workspace.
- If you integrate Hive with Ranger in EMR, you must modify whitelist configurations and restart Hive before you develop
EMR Hive nodes in DataWorks. Otherwise, the error message Cannot modify spark.yarn.queue at runtime or Cannot modify SKYNET_BIZDATE at runtime is returned when you run EMR Hive nodes.
- You can modify the whitelist configurations by using custom parameters in EMR. You
can append key-value pairs to the value of a custom parameter. In this example, the
custom parameter for Hive components is used. The following code provides an example:
hive.security.authorization.sqlstd.confwhitelist.append=tez.*|spark.*|mapred.*|mapreduce.*|ALISA.*|SKYNET.*
Note In the preceding code,ALISA.*
andSKYNET.*
are specific to DataWorks. - After the whitelist configurations are modified, you must restart the Hive service to make the configurations take effect. For more information, see Restart a service.
- You can modify the whitelist configurations by using custom parameters in EMR. You
can append key-value pairs to the value of a custom parameter. In this example, the
custom parameter for Hive components is used. The following code provides an example:
- An exclusive resource group for scheduling is created, and the resource group is associated
with the virtual private cloud (VPC) where the EMR cluster resides. For more information,
see Create and use an exclusive resource group for scheduling.
Note You can use only exclusive resource groups for scheduling to run EMR Hive nodes.