Quickly create a DataLake cluster - E-MapReduce - Alibaba Cloud Documentation Center

This topic describes how to log on to the new E-MapReduce (EMR) console by using your Alibaba Cloud account, create a DataLake cluster, and then create and run a job in the cluster.

Prerequisites

An Alibaba Cloud account is created, and real-name verification is complete.
The permissions of the default EMR and Elastic Compute Service (ECS) roles are granted to the EMR service. For more information, see Assign roles to an Alibaba Cloud account.

Precautions

The runtime environment of the code is managed and configured by the owner of the environment.

Procedure

Step 1: Create a cluster
Create a DataLake cluster in the new EMR console.
Step 2: Create and run a job
Create and run a Spark job in the DataLake cluster.
Step 3: View the details of the job
View the details of the job on the YARN web UI.
(Optional) Step 4: Release the cluster
If you no longer need to use the cluster, release the cluster to reduce costs.

Step 1: Create a cluster

Go to the cluster creation page.
1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
2. In the top navigation bar, select the region where you want to create a cluster and select a resource group based on your business requirements.
  - You cannot change the region of a cluster after the cluster is created.
  - By default, all resource groups in your account are displayed.
3. On the EMR on ECS page, click Create Cluster.

On the page that appears, configure the parameters. The following table describes the parameters.

Step	Parameter	Example	Description
Software Configuration	Region	China (Hangzhou)	The geographic location where the ECS instances of the cluster reside. Important You cannot change the region after the cluster is created. Select a region based on your business requirements.
	Business Scenario	Data Lake	The business scenario of the cluster. Select a business scenario based on your business requirements. Alibaba Cloud EMR automatically configures the components, services, and resources to simplify cluster configuration and provide a cluster environment that meets the requirements of a specific business scenario.
	Product Version	EMR-5.14.0	The version of EMR. Select the latest version.
	High Service Availability	Off	Specifies whether to enable high availability for the EMR cluster. If you turn on the High Service Availability switch, EMR distributes master nodes across different underlying hardware devices to reduce the risk of failures. By default, the High Service Availability switch is turned off.
	Optional Services (Select One At Least)	Hadoop-Common, OSS-HDFS, YARN, Hive, Spark3, Tez, Knox, and OpenLDAP	The services that can be selected for the cluster. You can select services based on your business requirements. The processes that are related to the selected services are automatically started. Note In addition to the default services of the cluster, you must also select the Knox and OpenLDAP services.
	Collect Service Operational Logs	On	Specifies whether to enable log collection for all services. By default, this switch is turned on to collect the service operational logs of your cluster. The logs are used only for cluster diagnostics. After you create a cluster, you can modify the Collection Status of Service Operational Logs parameter on the Basic Information tab. Important If you turn off this switch, the EMR cluster health check and service-related technical support are limited. For more information about how to disable log collection and the impacts imposed by disabling of log collection, see How do I stop collection of service operational logs?
	Metadata	DLF Unified Metadata	If you select DLF Unified Metadata for Metadata, metadata is stored in Data Lake Formation (DLF). The system selects the default DLF catalog for you to store metadata in DLF. If you want different clusters to be associated with different DLF catalogs, you can click Create Catalog to create DLF catalogs based on your business requirements. Note To configure this parameter, make sure that Alibaba Cloud DLF is activated.
	Root Storage Directory of Cluster	1366993922******	The root storage directory of cluster data. This parameter is required only if you select the OSS-HDFS service. Note Before you use the OSS-HDFS service, make sure that the OSS-HDFS service is available in the region in which you want to create a cluster. If the OSS-HDFS service is unavailable in the region, you can change the region or use HDFS instead of OSS-HDFS. For more information about the regions in which OSS-HDFS is available, see Enable OSS-HDFS and grant access permissions. You can select the OSS-HDFS service when you create a DataLake cluster in the new data lake scenario, a Dataflow cluster, a DataServing cluster, or a custom cluster of EMR V5.12.1, EMR V3.46.1, or a minor version later than EMR V5.12.1 or EMR V3.46.1.
Hardware Configuration	Billing Method	Pay-as-you-go	The billing method of the cluster. If you want to perform a test, we recommend that you use the pay-as-you-go billing method. After the test is complete, you can release the cluster and create a subscription cluster in the production environment.
	Zone	Zone I	The zone where the cluster resides. You cannot change the zone after the cluster is created. Select a zone based on your business requirements.
	VPC	vpc_Hangzhou/vpc-bp1f4epmkvncimpgs****	The virtual private cloud (VPC) in which the cluster is deployed. Select a VPC in the current region. If no VPC is available, click Create VPC to create a VPC. After the VPC is created, click the Refresh icon and select the created VPC.
	vSwitch	vsw_i/vsw-bp1e2f5fhaplp0g6p****	The vSwitch of the cluster. Select a vSwitch in the specified zone. If no vSwitch is available in the zone, create a vSwitch.
	Default Security Group	sg_seurity/sg-bp1ddw7sm2risw****	Important Do not use an advanced security group that is created in the ECS console. The security group to which you want to add the cluster. If you have created security groups in EMR, you can select a security group based on your business requirements. You can also create a security group.
	Node Group	Turn on the Assign Public Network IP switch for the master node and use default settings of other parameters	The instances in the cluster. Configure the master node, core nodes, and task nodes based on your business requirements. For more information, see Select configurations.
Basic Configuration	Cluster Name	Emr-DataLake	The name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).
	Identity Credentials	Password	The identity credentials that you want to use to remotely access the master node of the cluster.
	Password and Confirm Password	Custom password	The password that you want to use to access the cluster. Record this password for subsequent operations.

Click Next: Confirm. In the Confirm step, read the terms of service, select the check box, and then click Confirm.
The cluster is successfully created if the cluster is in the Running state. For more information about cluster parameters, see Create a cluster.

Step 2: Create and run a job

After the cluster is created, you can create and run a job in the cluster.

Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
Run a command in the CLI to submit and run a job.
In this example, Spark 3.1.1 is used and the following command is used to submit and run a job:
```
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --driver-memory 512m --num-executors 1 --executor-memory 1g --executor-cores 2 /opt/apps/SPARK3/spark-current/examples/jars/spark-examples_2.12-3.1.1.jar 10
```
Note
spark-examples_2.12-3.1.1.jar is the JAR package in your cluster. You can log on to the cluster and obtain the package in the /opt/apps/SPARK3/spark-current/examples/jars directory.

Step 3: View the details of the job

View the details of the job on the YARN web UI.

Enable port 8443. For more information, see Manage security groups.
Add a user. For more information, see Manage user accounts.
To access the YARN web UI by using your Knox account, you must obtain the username and password of the Knox account.
On the EMR on ECS page, find your cluster and click Services in the Actions column.
On the page that appears, click Access Links and Ports.
On the Access Links and Ports tab, click the link to the right of Internet in the Knox Proxy Address column for YARN UI.
Use the added user for logon authentication and access the YARN web UI.
On the All Applications page, click the ID of the job to view the details of the job.

Step 4: (Optional) Release the cluster

If you no longer need to use the cluster, you can release it to reduce costs. After you confirm the release of a cluster, the system performs the following operations on the cluster:

Forcibly terminate all jobs in the cluster.
Terminate and release all ECS instances that are created for the cluster.

The time required to release a cluster is based on the size of the cluster. Most clusters can be released in seconds. It does not require more than 5 minutes to release a large cluster.

Important

A pay-as-you-go cluster can be released at any time. A subscription cluster can be released only after the cluster expires.
Before you release a cluster, make sure that the cluster is in the Initializing, Running, or Idle state.

On the EMR on ECS page, find your cluster, move the pointer over the icon, and then select Release.
You can also release the cluster by performing the following operations: Click the name of the cluster. In the upper-right corner of the Basic Information tab, choose All Operations > Release.
In the Release Cluster message, click OK.

References

For more information about how to log on to a cluster, see Log on to a cluster.
For more information about the paths of files that are frequently used in EMR, see Paths of frequently used files.
For more information about deployment sets, see Add nodes to a deployment set.
For more information about tags, see Manage and use tags.
For more information about resource groups, see Use resource groups.
For more information about security groups, see Manage security groups.
For more information about the API operations that are available for cluster management and cluster service management, see List of operations by function.

FAQ

Learn FAQ about Alibaba Cloud EMR. For more information, see FAQ.