Data Studio lets you upload and manage E-MapReduce (EMR) JAR and File resources, then register them as user-defined functions (UDFs) for use in data development nodes and SQL queries.
Prerequisites
Before you begin, ensure that you have:
An EMR compute resource or an EMR Serverless Spark compute resource attached to your DataWorks workspace
The resource files to upload, available on your local computer or in an Object Storage Service (OSS) bucket
If you upload files from OSS, the following conditions must also be met:
OSS is activated, a bucket is created, and the files are stored in that bucket. Create a bucket before creating the resource, then upload the files to the bucket.
The Alibaba Cloud account used to upload the files has read and write access to the bucket. Grant the required permissions before uploading.
Go to Resource Management
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a region. Find your workspace and choose Shortcuts > Data Studio in the Actions column.
In the left navigation pane, click the Resource Management icon
to open the Resource Management page.Click the
icon to create a resource or function. To organize resources first, click Create Directory, then right-click a folder and choose the resource or function type to create.
Resource types
Data Studio supports the following EMR resource types:
| Resource type | Description | Supported upload methods |
|---|---|---|
| EMR File | Any file type uploaded as a File resource. Whether the file can be used depends on the compute engine. | Local, OSS |
| EMR Jar | A compiled Java JAR package for running Java programs. The file must have a .jar extension. |
Resources can be stored in OSS or in Hadoop Distributed File System (HDFS).
Using EMR resources stored in or uploaded to OSS incurs standard OSS fees.
Limits
Publishing to production: In a standard mode workspace, publish the resource to the production environment before it takes effect. Data source configurations differ between the development and production environments—confirm the correct configuration before querying.
Resource visibility: DataWorks only shows resources that were uploaded through the DataWorks interface.
Create a resource
On the Resource Management page, click the create icon to open the Create Resource or Function dialog box. Set the Type, Path, and Name of the resource.
After creating the resource, configure the upload source and storage. The key parameters are described below.
Parameter Description File source The source of the file. Valid values: Local and OSS. File content If Local is selected, click Click to upload in the Upload File section. If OSS is selected, choose a file from the Select File drop-down list. Storage path Where the resource is stored. Valid values: OSS and HDFS. For OSS, grant permissions and then select a directory (authorization requires an Alibaba Cloud account). For HDFS, manually enter the path, for example, /user/admin/[specific path]. JAR packages can be stored on the EMR cluster's master node or in OSS. OSS is recommended—see Store JAR packages in OSS.Data source The data source the uploaded EMR resource belongs to. Resource group A Serverless resource group that can connect to the EMR data source. If no Serverless resource group is available, add one. In the toolbar, click Save and then Publish. Only published resources can be used in Data Studio.
NoteWhen a Serverless resource group submits a resource, DataWorks sends a task to the DPI engine and generates execution logs. Use these logs to troubleshoot any submission issues.
Use a resource
After creating a resource, reference it in a data development node:
In the left navigation pane, click Resource Management.
Find the target resource, right-click it, and select Reference Resource.
The following code is added to the node:
##@resource_reference{"Resource Name"}For example, referencing example.jar from an EMR MR node adds ##@resource_reference{"example.jar"}. The exact format varies by node type.
Alternatively, register the resource as a UDF and use it as a function.
Create a function
Before creating a function, create the resource it will be based on.
Function description
In the Resource Management section of Data Studio, you can register a resource as an EMR function. In Data Studio or in SQL queries, you can use the built-in functions provided by Hive in addition to the user-defined functions that you create.
Function types
When registering a function, select one of the following types:
| Type | Description |
|---|---|
| MATH | Mathematical operations |
| AGGREGATE | Aggregate functions |
| STRING | String operations |
| DATE | Date operations |
| ANALYTIC | Window (analytic) functions |
| OTHER | Other function types |
Create and publish a function
On the Resource Management page, click the create icon to open the Create Resource or Function dialog box. Set the Type, Path, and Name of the function.
Click Confirm, then configure the function parameters:
Parameter Description Function Type The function category. See the function types table above. Data Source The data source where the function will be registered. EMR Database The EMR database where the function will be registered. Resource Group A Serverless resource group that can connect to the EMR data source. Class Name The fully qualified class name of the UDF, in the format Java package name.Actual class name. This must match the class name in the JAR package. For example, if the Java package iscom.aliyun.emr.examples.udfand the class isUDAFExample, set this tocom.aliyun.emr.examples.udf.UDAFExample. To find the class name, run Copy Reference in IntelliJ IDEA.Resource List Required. Select a resource from the current workspace. In the toolbar, click Save and then Publish. Only published functions are available in Data Studio.
Use a function
After a function is published, use it in a data development node or an SQL query.
In a data development node:
In the left navigation pane, click Resource Management.
Find the function, right-click it, and select Insert Function.
The function name is automatically inserted into the editor, for example, example_function().
In an SQL query:
Call the function directly by name:
SELECT example_function(column_name) FROM table;Manage resources and functions
From the Resource Management page, you can manage existing resources and functions.
View version history: Click the version icon on the right side of the resource or function editing page to view and compare saved or submitted versions. Select at least two versions to compare.
Delete a resource or function: Right-click the target item and select Delete. To remove it from the production environment, publish the deletion. The resource or function is deleted from the production environment after the task is published.