DataWorks provides the image management feature that you can use to create and manage task running environments. If a specific environment is required for task running, you can use the image management feature to create a custom image that integrates required development packages and dependencies. For example, you can use a custom image to install third-party dependencies that are required to run PyODPS tasks. This topic describes how to use the image management feature to create a custom image.
Background information
By default, DataWorks uses the Default standard image to run tasks. DataWorks provides an appropriate image based on the tasks that you want to run. The official images serve as pre-configured base images to provide a standardized runtime environment for tasks of specific types. Custom images provide enhanced functionality and flexibility on the basis of the official images. You can expand the base images based on your actual application scenarios to achieve on-demand feature customization. This helps improve the execution efficiency and flexibility of data processing tasks.
Methods to create a custom image:
Based on a DataWorks official image
By referencing an image in Alibaba Cloud Container Registry
Based on a personal development environment
Node types that support custom images and image creation methods:
Node type
Creation by referencing an image in Container Registry
Creation based on a DataWorks official image
PyODPS 2
PyODPS 3
EMR Spark
EMR Spark SQL
EMR Shell
Shell
Python
Notebook
Usage notes
The image management feature can be used together with only a serverless resource group.
NoteIf a third-party package is required when you use an old-version exclusive resource group for scheduling to run PyODPS nodes, you can use the O&M Assistant feature to install the third-party package. For more information, see Use an exclusive resource group for scheduling to configure a third-party open source package (not recommended).
The maximum number of custom images that can be created varies based on the DataWorks edition.
DataWorks Basic Edition and Standard Edition: 10
DataWorks Professional Edition: 50
DataWorks Enterprise Edition: 100
Only DataWorks Professional Edition or a more advanced edition supports the image building feature.
A maximum of two images can be built at the same time in each region.
Prerequisites
A serverless resource group is created. The image management feature must be used together with a serverless resource group. For more information about serverless resource groups, see Create and use a serverless resource group.
Optional. The virtual private cloud (VPC) with which the serverless resource group is associated has access to the Internet. This prerequisite is required if the environment in which you want to run tasks depends on a third-party package that is deployed over the Internet. For more information, see Use the SNAT feature of an Internet NAT gateway to access the Internet.
The AliyunDataWorksFullAccess policy or a policy that contains the ModifyResourceGroup permission is attached to the account that you want to use. For more information, see Manage permissions on the DataWorks services and the entities in the DataWorks console by using RAM policies.
Container Registry is activated. This prerequisite is required if you want to create a custom image by referencing an image in Container Registry. For more information about Container Registry, see What is Container Registry?
Step 1: Go to the Image Management page
Log on to the DataWorks console.
Go to the Image Management page.
In the left-side navigation pane, click Image Management.
Step 2: Create a custom image
When you create a custom image in the DataWorks console, you can set the Reference Type parameter to DataWorks Official Image or Alibaba Cloud Container Registry Image. The parameters that are configured to create a custom image vary based on the reference type that you select.
Method 1: Create a custom image based on a DataWorks official image
Configure parameters that are described in the following table.
Parameter
Description
Image Name
The name of the custom image.
Image Description
The description of the custom image.
Reference Type
Select DataWorks Official Image.
Image Namespace
The value of this parameter is fixed to DataWorks Default.
Image Repository
The value of this parameter is fixed to DataWorks Default.
Image Name/ID
Select a DataWorks official image based on which you want to create a custom image.
Visible Scope
The scope in which the custom image is visible. Valid values: Visible Only to Creator and Visible to all.
Module
The service to which the custom image can be applied. This parameter can only be set to DataStudio.
Supported Task Type
dataworks_shell_task_pod: available for
Shell
tasksdataworks_pyodps_task_pod: available for
PyODPS 2
andPyODPS 3
tasksdataworks_emr_datalake_5.15.1_task_pod: available for
E-MapReduce (EMR) Spark
,EMR Spark SQL
, andEMR Shell
tasks
Installation Package
The third-party package that you want to use. You can use one of the following methods to install a third-party package:
Quick installation: Select
Python2
,Python3
, orYum
from the Installation Package drop-down list and then select a desired environment or resource.Manual input: Select
Script
from the Installation Package drop-down list. Then, write commands in the command box to install a desired third-party package. You can run one of the following commands to install a third-party package:pip install xx
for Python 2/home/tops/bin/pip3 install 'urllib3<2.0'
for Python 3yum install -y git
wget git
Click OK.
Method 2: Create a custom image by referencing an image in Container Registry
Limits
DataWorks allows you to reference only an image in Container Registry Enterprise Edition.
DataWorks allows you to access a Container Registry instance used to build an image only over a VPC.
A Container Registry instance that you can use in DataWorks cannot exceed 5 GB in size.
Configure parameters that are described in the following table.
Parameter
Description
Image Name
The name of the custom image.
Image Description
The description of the custom image.
Reference Type
Select Alibaba Cloud Container Registry Image.
Image Instance ID
Select a Container Registry Enterprise Edition instance that is created in Alibaba Cloud Container Registry by instance ID. For information about how to create an instance, see Create a Container Registry Enterprise Edition instance.
Image Namespace
Select a namespace based on the selected instance. For information about how to create a namespace, see Create a namespace.
Image Repository
Select an image repository based on the selected instance. For information about how to create an image repository, see Create an image repository.
Image Version
Select a version for the custom image that you want to create based on the selected image repository.
VPC to Associate
Select a VPC with which the instance needs to be associated. For information about how to configure a VPC, see Configure a VPC ACL.
Visible Scope
The scope in which the custom image is visible. Valid values: Visible Only to Creator and Visible to all.
Module
The service to which the custom image can be applied. This parameter can only be set to DataStudio.
Supported Task Type
A Container Registry image is started by using
startup commands and a task code file path
. The following information describes supported task types and default startup commands:Shell
Python
If you want to apply a custom image that is created based on an Alibaba Cloud Container Registry image to a Python task, you must make sure that the desired Container Registry instance contains a Python environment.
Notebook
If you want to apply a custom image that is created based on an Alibaba Cloud Container Registry image to a notebook task, use the following basic notebook image provided by DataWorks as the base image of the Container Registry image to provide a runtime environment for the notebook task:
dataworks-public-registry.cn-shanghai.cr.aliyuncs.com/public/dataworks-notebook:py3.11-ubuntu22.04-20241202
Make sure that the environment that you use to create an image has access to the Internet. This way, you can obtain the basic notebook image provided by DataWorks as expected.
Click OK.
Method 3: Create a custom image based on a personal development environment instance
New-version Data Studio allows you to create an image for a personal development environment. For more information, see Create an image of a personal development environment instance.
Step 3: Publish the custom image
On the Custom Images tab, find the created custom image.
Click Publish in the Actions column.
In the Publish Image panel, configure the Test Resource Group parameter and click Test to the right of Test Result.
NoteSelect a serverless resource group for Test Resource Group.
After the test succeeds, click Publish.
Only images that pass the test can be published.
If you configure a third-party package that is deployed over the Internet as a custom image and the image cannot pass the test after a long period of time, check whether the VPC with which the selected test resource group is associated can access the Internet. If the VPC cannot access the Internet, enable Internet access for the VPC. For more information, see Use the SNAT feature of an Internet NAT gateway to access the Internet.
If images fail to pass the test, you can perform the following operations to modify image configurations: Find a desired custom image, move the pointer over the
icon in the Actions column, and then select Modify.
Step 4: Associate the custom image with a workspace
On the Custom Images tab, find the custom image that is published.
Move the pointer over the
icon in the Actions column and select Change Workspace to associate the custom image with a workspace.
Step 5: Build a permanent image
After you complete the operations in Step 3, you can use the custom image as expected in your business. However, each time you run a node that uses the custom image, DataWorks redeploys the image environment and downloads a third-party package. As a result, the node running duration is extended and more computing fees may be generated. In this case, DataWorks allows you to create custom images as permanent images. This way, the same image environment can be used each time you run a node, which frees you from repeatedly deploying an image environment. This ensures the consistency of the runtime environment and reduces task running duration, computing costs, and traffic costs.
You can create only custom images that are created based on official images as permanent images.
Perform the following steps to build a permanent image:
Log on to the DataWorks console. In the top navigation bar, select a desired region. Then, click Image Management in the left-side navigation pane.
On the Image Management page, click the Custom Images tab. On the Custom Images tab, find the custom image that is published.
Move the pointer over the
icon in the Actions column and select Create.
In the Resource Group for Which You Want to Create Image dialog box, select a resource group that you want to use from the drop-down list and click Continue.
NoteIt takes approximately 5 to 10 minutes to complete image building. The actual time that is required varies based on the size of the image that you want to build.
You are charged computing fees when you build an image. The computing fees are calculated by using the following formula:
0.5 CUs × Duration for image building
. For more information, see Billing of data computing.An image may fail to be built due to network exceptions. To prevent such issue, make sure that the resource group that you selected in the Resource Group for Which You Want to Create Image dialog box is the test resource group that you selected in Step 3: Publish the custom image in this topic.
What to do next: Use the custom image
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Find a desired node on the DataStudio page and double-click the node name to go to the configuration tab of the node. Click Properties in the right-side navigation pane and configure parameters in the Resource Group section.
Resource Group: Select a serverless resource group.
NoteMake sure that the Resource Group parameter is set to the test resource group that you selected when you published the custom image to ensure smooth running of the node.
If the desired resource group is not displayed from the Resource Group drop-down list, check whether the resource group is associated with the current workspace. If the resource group is not associated with the current workspace, you can perform the following operations to complete the association: Go to the Resource Groups page. Find the desired resource group and click Associate Workspace in the Actions column.
Image: Select a custom image that is published.
Save and commit the node.
NoteThe image that is selected in DataStudio cannot be synchronized to the production environment. You must follow the instructions that are described in Deploy nodes to deploy the node to allow the image to take effect in the production environment.
Example: Use an image in a PyODPS node to segment data in Chinese
You want to segment a column of data in Chinese in a MaxCompute table on a node and store the segmentation result in another table for a descendant node to use. In this case, you can pre-install the segmentation tool Jieba in a desired custom image and run a PyODPS task to use the custom image to segment data in Chinese in the MaxCompute table and store the segmentation result in another table. This way, the descendant node can seamlessly schedule the data.
Create test data.
Add a MaxCompute data source to DataWorks, and associate the MaxCompute data source with DataStudio. For more information about how to add a MaxCompute data source, see Add a MaxCompute data source.
In DataStudio, create an ODPS node, create a test table, and then add test data to the table.
NoteIn the following example, a scheduling parameter is used. On the Properties tab in the right-side navigation pane of the configuration tab of the node, add a parameter whose name is
bday
and value is$[yyyymmdd]
in the Scheduling Parameter section.Save and deploy the node.
Create a custom image.
Follow the instructions that are described in Step 2: Create a custom image in this topic to create a custom image. Settings of key parameters:
Image Name/ID: Select
dataworks_pyodps_task_pod
.Supported Task Type: Select
PyODPS 2
andPyODPS 3
.Installation Package: Select
Python3
andjieba
.
Publish the custom image and associate the custom image with a workspace. For more information, see the Step 3: Publish the custom image and Step 4: Associate the custom image with a workspace sections in this topic.
Use the custom image in a scheduling task.
In DataStudio, create and configure a PyODPS 3 node.
On the Properties tab in the right-side navigation pane of the configuration tab of the node, configure the following key settings:
Add a scheduling parameter whose name is
bday
and value is$[yyyymmdd]
in the Scheduling Parameter section.Select a serverless resource group, which is the test resource group that you used when you published the custom image, as a resource group for scheduling.
Select the custom image that is published and associated with the current workspace.
Save and run the node with parameters configured.
Optional. Create an ad hoc query and execute the following SQL statement to check whether the output table contains data:
SELECT * FROM participle_tb WHERE ds=<Partition date>;
Deploy the PyODPS node to the production environment.
NoteThe image that is selected in DataStudio cannot be synchronized to the production environment. You must follow the instructions that are described in Deploy nodes to deploy the node to allow the image to take effect in the production environment.
Create the custom image as a permanent image. For more information, see the Step 5: Build a permanent image section in this topic.
Appendix: View official images
The following table describes the official images supported by DataWorks. You can also go to the Image Management page to view the official images.
Image name | Supported task type | Description |
dataworks_pyodps_py311_task_pod |
| This image is suitable for only PyODPS 3 tasks. |
dataworks_pairec_task_pod |
| This image is suitable for PyODPS 3 tasks to run an algorithm process generated by the PAI-Rec engine. |
dataworks_pyodps_task_pod |
| - |
dataworks_emr_datalake_5.15.1_task_pod |
| You can use this image to commit tasks in EMR DataLake clusters of V5.15.1. |
dataworks_shell_task_pod |
| - |
dataworks_python_task_pod |
| - |
dataworks_cdh_custom_task_pod |
|
|