Data Studio lets you visually create and manage E-MapReduce (EMR) JAR and File resources. You can use these resources to create user-defined functions or use them directly in Data Studio. This topic describes how to create and use these resources and functions.
Prerequisites
You have attached an EMR compute resource or an EMR Serverless Spark compute resource. EMR resources and functions are created based on these compute resources.
You have developed the required resource files. You can upload files from your local computer or retrieve them from Object Storage Service (OSS). If you create a resource by uploading a file from OSS, the following conditions must be met:
OSS is activated, an OSS bucket is created, and the file that you want to upload is stored in the OSS bucket. Because you must select a file from a specified bucket, you must first create a bucket and upload the related files before you create the resource.
The Alibaba Cloud account that you use to upload the file has the permissions to access and write data to the destination bucket. To prevent permission issues, grant the required permissions to the account before you upload the file.
Go to Resource Management
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
In the navigation pane on the left, click the Resource Management icon
to go to the Resource Management page.On the page, click the
icon to create a new resource or function. Alternatively, you can first click Create Directory to organize your resources. Then, right-click the target folder and choose the type of resource or function to create.
Create and use resources
Resource description
In the Resource Management section of Data Studio, you can create the resources shown in the following table. You can store the created resources in OSS or Hadoop Distributed File System (HDFS). You can then use them in Data Studio or to create user-defined functions.
When you upload EMR resources to OSS or use EMR resources that are stored in OSS, basic fees for OSS apply.
Resource type | Description | Supported upload methods | |
Local | OSS | ||
EMR File | Upload any type of file as a File resource. Actual usage depends on whether the compute engine supports the file type. |
|
|
EMR Jar | A compiled Java JAR package used to run Java programs. The file name extension is | ||
Limits
Uploaded resources must meet the following limits:
Resource size:
Resource publishing: If you use a standard mode workspace, you must publish the resource to the production environment for it to take effect.
NoteThe data source information differs between the development environment and the production environment. Before you query tables or resources in an environment, confirm that the data source information is correct for that environment.
Resource management: In DataWorks, you can view and manage only the resources that you upload using the DataWorks interface.
Create a resource
You can upload EMR resources from your local computer or from OSS. You can reference the created resource directly in Data Studio or use it to create a function.
On the Resource Management page, in the Create Resource or Function dialog box that appears, configure the Type, Path, and Name of the resource.
After you create the resource, upload a local file or an OSS object to serve as the file source. The following table describes the key parameters for uploading a resource.
Configuration Item
Parameter description
File source
The source of the object file. Valid values: Local and OSS.
File content
If you select Local, click Click to upload in the Upload File section to upload a local file.
If you select OSS, select an OSS file from the Select File drop-down list.
Storage path
The path where the resource is stored. Valid values: OSS and HDFS.
If you select OSS, grant permissions and then select a directory.
NoteYou must use your Alibaba Cloud account to perform authorization.
If you select HDFS, you must manually enter the storage path.
Example:
/user/admin/[specific path].
NoteA JAR package can be stored in one of the following locations:
The JAR package is stored on the EMR cluster's Master node.
JAR packages are stored in Object Storage Service (Object Storage Service, OSS). We recommend that you use OSS for storage. For more information, see Store JAR packages in OSS.
Data source
The data source to which the uploaded EMR resource belongs.
Resource group
Select a Serverless resource group that can connect to the EMR data source.
In the toolbar, click Save and then Publish. Only published resources can be used in Data Studio.
NoteWhen you use a Serverless resource group to submit a resource, DataWorks sends a task to the DPI engine to create the resource and generates execution logs. You can use these logs to troubleshoot issues that occur during submission. If you do not have an available Serverless resource group, add a Serverless resource group.
Use a resource
After you create a resource, go to Resource Management in the navigation pane on the left. Find the target resource, right-click it, and select Reference Resource. When the resource is referenced, code in the ##@resource_reference{"Resource Name"} format is added to the node.
For example, the code for an EMR MR node is ##@resource_reference{"example.jar"}. The code format varies depending on the node type.
You can also create a function from the resource and then use the function in a development node.
Create and use functions
Before you create a function, make sure that you have created a resource.
Function description
In the Resource Management section of Data Studio, you can register a resource as an EMR function. In Data Studio or in SQL queries, you can use the built-in functions provided by Hive in addition to the user-defined functions that you create.
Create a function
On the Resource Management page, in the Create Resource or Function dialog box that appears, configure the Type, Path, and Name of the function.
Click Confirm to create the function resource. Then, configure the parameters for the function based on its type.
Before you configure an EMR function, make sure that the EMR cluster is registered as a compute resource in DataWorks and that you have created an EMR resource. The following table describes the key parameters for an EMR function.
Parameter
Description
Function Type
The type of the function. Valid values: MATH (mathematical operation), AGGREGATE (aggregate function), STRING (string), DATE (date), ANALYTIC (window), and OTHER (other).
Data Source
The data source where you want to register the EMR function.
EMR Database
The EMR database where you want to register the function.
Resource Group
Select a Serverless resource group that can connect to the EMR data source.
Class Name
The class name of the user-defined function (UDF). The class name must be in the
Resource name.Class nameformat and must be the same as the class name in the JAR package.If the resource type is JAR, set the Class Name parameter to a value in the
Java package name.Actual class nameformat. You can run theCopy Referencestatement inIntelliJ IDEAto obtain the class name.For example, if the Java package name is
com.aliyun.emr.examples.udfand the actual class name isUDAFExample, set the Class Name parameter tocom.aliyun.emr.examples.udf.UDAFExample.Resource List
This parameter is required. Select a resource that is added to the current workspace from the drop-down list.
In the toolbar, click Save and then Publish. Only published functions can be used in Data Studio.
Use a function
After a function is created and published, you can reference it directly in Data Studio or in an SQL query.
When you edit a data development node, click Resource Management in the navigation pane on the left. Then, find the target resource or function, right-click it, and select Insert Function.
After the function is successfully referenced, the name of the user-defined function is automatically inserted into the node editing page, for example,
example_function().When you edit an SQL query, you can directly use the created function.
SELECT example_function(column_name) FROM table;Manage resources and functions
After you upload a resource or create a function in Data Studio, you can go to the Resource Management page and select the target resource or function to manage it.
View historical versions: Click the version icon on the right side of the resource or function editing page to view and compare saved or submitted versions. This lets you see the changes between different versions.
NoteYou must select at least two versions to compare.
Delete a resource or function: Right-click the target resource or function and select Delete.
To delete the resource or function from the production environment, you must publish the deletion to the production environment. After the task is published, the resource or function is deleted from the production environment.