An Amazon EMR compute source provides computing resources for processing compute tasks in Dataphin projects. If the compute engine of the Dataphin system is set to Amazon EMR, projects can use features such as compute tasks, ad hoc queries, and general scripts only after an Amazon EMR compute source is added to the project. This topic describes how to create an Amazon EMR compute source.
Prerequisites
The compute engine of Dataphin is set to Amazon EMR. For more information, see Initialize the metadata warehouse by using Amazon EMR as the metadata warehouse engine.
An Amazon EMR cluster is created. For more information, see Create and manage an Amazon EMR cluster.
Procedure
In the top navigation bar of the Dataphin homepage, choose Planning > Compute Source.
On the Compute Source page, click New Compute Source and select Amazon EMR Compute Source.
In the Create Amazon EMR Compute Source dialog box, configure the required parameters.
Parameter
Description
Basic Information
Compute Type
Select Amazon EMR.
Compute Source Name
Supports Chinese characters, letters, digits, underscores (_), and hyphens (-). The name cannot exceed 64 characters in length.
Configuration Method
Currently, only Reference Specified Cluster is supported. You can enter keywords to search. After selection, you can click View to go to the View Amazon EMR Cluster page to view cluster information.
Description (optional)
Enter a brief description of the compute source. The description cannot exceed 128 characters in length.
Compute Configuration
Primary Node Public DNS
The system automatically obtains this information from the selected Amazon EMR cluster. Modification is not supported.
Database
Enter the database name of the Amazon EMR compute engine.
Spark SQL
You can select Enable or Disable. The default value is Enable.
NoteThis parameter can be configured only when Spark SQL is enabled on the referenced cluster.
Spark Local Client
You can select Enable or Disable. The default value is Enable.
NoteThis parameter can be configured only when both Spark SQL and Spark Local Client are enabled on the referenced cluster.
Default Queue For Production Tasks (optional)
Enter a YARN resource queue. Manual and scheduled tasks in the production environment will use this queue.
Queue For Other Tasks (optional)
Enter a YARN resource queue. Other tasks (such as ad hoc queries, data previews, and JDBC Driver access) will use this queue.
Queue For Priority Tasks
You can select Use Default Queue For Production Tasks or Custom.
If you select Custom, you can specify a YARN resource queue for each priority level.
NoteWhen Dataphin schedules Hive SQL tasks, it sends tasks to the corresponding queues based on task priorities. When the execution engine of Hive is set to Tez or Spark, you must configure different priority queues for the task priority settings to take effect.
Click Submit.
After you create an Amazon EMR compute source, you can attach it to a project. For more information, see Create a general project.