Image management in DataWorks allows users to create and manage custom runtime environments for task execution. This interface enables the creation of custom images that incorporate necessary development packages and dependencies tailored to specific execution environments. For instance, custom images can be used to install third-party dependencies essential for running PyODPS tasks. This topic describes the process for creating custom images using image management features.
Background information
By default, DataWorks utilizes the Default standard image when executing tasks, selecting the most suitable image based on the task type. Official images act as pre-configured base images, providing a standardized runtime environment for various task types. Custom images build upon these base images, offering enhanced functionality and flexibility. Users can tailor these images to their specific application needs, optimizing the execution efficiency and adaptability of data processing tasks. DataWorks supports three primary methods for customizing images:
Direct creation of custom images from DataWorks official images.
Refer to images from Alibaba Cloud ACR (Container Registry) for more information.
Create custom images from your personal development environments.
Instructions
The image management feature is only available in conjunction with a serverless resource group.
NoteIf you are operating PyODPS task nodes with a legacy exclusive resource group for scheduling and depend on third-party packages, you can utilize the maintenance assistant. For more information, see how to configure third-party packages for exclusive resource groups for scheduling (not recommended).
The maximum number of custom images that can be created depends on the DataWorks edition:
Basic and Standard Editions: 10.
Professional Edition: 50.
Enterprise Edition: 100.
Only the Professional Edition and higher support the image building feature.
A maximum of two images can be built simultaneously in each region.
If selecting the Default standard image for EMR-type tasks results in long wait times, it may be due to older EMR cluster version images not being initialized. To address this, submit a ticket.
Prerequisites
A serverless resource group has been created. This feature must be utilized in conjunction with a serverless resource group. For more information on serverless resource groups, see Create and use serverless resource groups.
(Optional) If the task's operating environment requires dependencies on third-party packages from the public network, the VPC associated with the serverless resource group must be capable of accessing the public network. For specific configuration details, see how to access the Internet using the SNAT feature of the public NAT Gateway.
You have either the AliyunDataWorksFullAccess or ModifyResourceGroup policy. For more information about authorization, see Product and console permission control details: RAM Policy.
Before creating a custom image from an ACR image, ensure that the Container Registry is enabled. For more information about the Container Registry ACR, see Container Registry ACR.
Step 1: Access image management
Log on to the DataWorks console.
Access the image management page.
In the left-side navigation pane, click Image Management to access the image management page.
Step 2: Create a custom image
DataWorks supports the creation of custom images using either DataWorks Official Images or Alibaba Cloud ACR Images as the base. The following describes the configuration parameters for different base image types:
Method 1: Create directly based on DataWorks official images
Configure the custom image parameters:
Parameter
Description
Image Name
The name of the custom image.
Image Description
The description of the custom image.
Reference Data Type
Select Dataworks Official Images.
Image Namespace
Fixed as DataWorks Default.
Image Repository
Fixed as DataWorks Default.
Image Name/id
DataWorks official images, supported options:
dataworks_shell_task_pod
dataworks_pyodps_task_pod
dataworks_emr_datalake_5.15.1_task_pod
dataworks_pyodps_py311_task_pod
dataworks_python_task_pod
dataworks_pairec_task_pod
Visibility
Support configuring the visibility of custom images, including Visible To Creator Only and Visible To All.
Sub-product Usage
The current custom image only supports Data Development.
Supported Task Types
DataWorks Shell node official image: Supports
Shell
task type.DataWorks PyODPS node official image: Supports
PyODPS 2
andPyODPS 3
task types.DataWorks EMR datalake 5.15.1 version official image: Supports
EMR Spark
,EMR Spark SQL
, andEMR SHELL
task types.
Installation Package
Add the required third-party packages as needed. The following methods are supported:
Quick installation: In the Installation Package drop-down selection box, select
Python2
,Python3
,Yum
. You can directly select the environment and resources to be installed.Manual input: In the Installation Package drop-down selection box, select
Script
. You can manually enter installation commands in the Script command box. You can choose the following manual input example commands to download third-party packages.pip example command:
pip install xx
, supports Python2.pip3 example command:
/home/tops/bin/pip3 install 'urllib3<2.0'
, supports Python3.yum example command:
yum install -y git
.wget example command:
wget git
.
Click OK.
Method 2: Create based on Alibaba Cloud ACR images
Conditions
DataWorks creation is only compatible with Alibaba Cloud ACR Enterprise Edition image instances.
DataWorks supports only selecting one VPC to access Alibaba Cloud ACR image instances.
DataWorks supports Alibaba Cloud ACR image instances up to 5 GB in size.
Configure the custom image parameters:
Parameter
Description
Image Name
The name of the custom image.
Image Description
The description of the custom image.
Reference Data Type
Select Alibaba Cloud ACR Images
Image Instance ID
Support selecting Enterprise Edition instances created in Alibaba Cloud Container Registry based on the instance ID. For more information about creating instances, see Create an Enterprise Edition instance.
Image Namespace
Support selecting the namespace under the image instance based on the selected instance. For more information about creating namespaces, see Create a namespace.
Image Repository
Support selecting the image repository under the image instance based on the selected instance. For more information about creating image repositories, see Create an image repository.
Image Version
Support selecting the image version of the custom image you need to create under the selected image repository.
Associated VPC
Select the VPC network bound to the image instance. For more information about configuring VPC networks, see Configure access control for virtual private clouds.
Visibility
Support configuring the visibility of custom images, including Visible To Creator Only and Visible To All.
Sub-product Usage
The current custom image only supports Data Development.
Supported Task Types
Shell
Python
Notebook
: When running Notebook tasks in DataWorks using ACR images, use the Notebook base image provided by DataWorks as the base image for your ACR image to provide a runtime environment for Notebook tasks. DataWorks provides the Notebook base image:dataworks-notebook:py3.11-ubuntu22.04:py3.11-ubuntu22.04-20241202
NoteIf you need to apply custom images created from Alibaba Cloud ACR images to Python tasks, confirm whether your ACR image instance contains a Python environment. Otherwise, Python tasks cannot be supported.
If you need to apply custom images created from Alibaba Cloud ACR images to Notebook tasks, ensure that the environment used to build the image has public network access capabilities to obtain the Notebook base image provided by DataWorks normally.
Click OK.
Method 3: Create based on personal development environment instances
Data Studio's new data development feature allows you to create a new image from a personal development environment. For more information, see Create an image from a personal development environment.
Step 3: Publish a custom image
On the Custom Images tab, locate the custom image you created.
Click Publish in the Actions column.
Select the Test Resource Group and click Test next to Test Results.
NoteChoose a serverless resource group as the test resource group.
Once the test is successful, click Publish.
Only images that pass the test can be published.
If your custom image retrieves third-party packages from the public network and consistently fails tests, verify that the VPC associated with the Test Resource Group has the capability to access the public network. For details on configuring public network access for VPCs, see Access the Internet using the SNAT feature of the public NAT Gateway.
If the test fails, you can click the Operation column of the target custom image and select
to modify the image configuration.
Step 4: Modify the image ownership space
On the Custom Images tab, locate the published custom image.
In the Operation column for the desired image, click
to attach the custom image to the associated workspace.
Step 5: Build a permanent image
After completing Step 3, custom images are typically ready for use in business scenarios. However, each time a task node runs, DataWorks redeploy the image environment and download third-party packages, potentially increasing node runtime and incurring additional compute and traffic costs. To address this, DataWorks enables the conversion of custom images into permanent images, ensuring a consistent runtime environment for each task node execution, thereby saving time and reducing costs.
Building permanent images is only supported for custom images created from official images.
Follow these steps:
Log on to the DataWorks console, switch to the appropriate region, and click Image Management in the left-side navigation pane.
On the Custom Images tab, locate the published custom image.
In the Operation column for the image, click
to initiate the creation of a permanent image.In the Select The Resource Group For Building The Image dialog box, select the resource group for image building, then click Continue.
NoteImage building typically takes about 5 to 10 minutes, but the exact time may vary depending on the image size.
Building an image will result in computing charges calculated at
0.5 CU × the duration of the build
. For more information, see the description of data computing billing.To prevent build failures due to network issues or other reasons, ensure that the Resource Group For Building The Image is the same as the Test Resource Group selected in Step 3: Publish a Custom Image.
What to do next: Use the image
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Within the data development feature, locate the task node for the custom image, click Scheduling Configuration on the right, and set the resource properties:
Scheduling Resource Group: Choose a serverless resource group.
NoteTo ensure smooth task node operation, the Scheduling Resource Group should match the Test Resource Group used during Image Publishing.
If the desired resource group is not listed, check if it's associated with the current workspace. Visit the Resource Group List page, locate the resource group, and click Bind Workspace in the Actions column to bind it.
Image: Select the published image.
Save and submit the changes.
NoteModifications made to the image in data development will not automatically synchronize to the production environment. You must publish the task to apply the changes in production.
Example: Use images to perform Chinese word segmentation through PyODPS nodes
If you need to segment Chinese text within a column of a MaxCompute table and store the results in another table for downstream scheduling nodes, you can install the jieba segmentation toolkit in a custom image. Then, use this image to process the segmentation of the Chinese text via PyODPS tasks and save the outcomes in a new table, ensuring smooth integration with the downstream scheduling flow.
-
Create test data.
-
Create a MaxCompute data source and bind it within DataWorks data development. For more information, see Create a MaxCompute data source.
-
Create an ODPS node in data development, establish a test table, and insert test data.
NoteThe example below utilizes scheduling parameters. Set the parameter name to
bday
and the value to$[yyyymmdd]
in the Scheduling Configuration on the right. -
Save and publish.
-
-
Create a custom image.
Refer to Step 2: Create a custom image. Key parameters include the following:
-
Image name/ID: Choose
dataworks_pyodps_task_pod
, the official DataWorks PyODPS node image. -
Supported task types: Select
PyODPS 3
. -
Installation package: Choose
Python3
andjieba
.
-
-
Publish the custom image and update the ownership project space. For more information, see Step 3: Publish a custom image and Step 4: Modify the image ownership space.
-
Use the custom image in a scheduling task.
-
Create and configure a PyODPS 3 node in data development with the following details:
-
On the Properties tab, configure the following key settings:
-
Scheduling parameters: Name
bday
, value$[yyyymmdd]
. -
Scheduling Resource Group: Choose a serverless resource group, the same as the Test Resource Group used when Publishing The Image.
-
Image: Select the published custom image bound to the current workspace.
-
-
Save, configure parameters, and run the node.
-
(Optional) Execute the following SQL statement in an ad hoc query to verify the output table contains data.
SELECT * FROM participle_tb WHERE ds=<partition date>;
-
Deploy the PyODPS node to the production environment.
NoteThe image updated in data development won't sync to the production environment. You must publish the task to apply changes in production.
-
-
Build the custom image as a permanent solution. For more information, see Step 5: Build a permanent image.
Appendix: View official images
-
Log on to the DataWorks console, switch to the region where your DataWorks workspace is located, and click Image Management in the left-side navigation pane.
-
View the official images available for DataWorks. The following official images are supported:
-
DataWorks Shell node official image: Supports
Shell
task types. -
DataWorks PyODPS node official image: Supports
PyODPS 2
andPyODPS 3
task types. -
DataWorks EMR datalake 5.15.1 version official image: Supports
EMR Spark
,EMR Spark SQL
, andEMR SHELL
task types.NoteYou can use this image to submit tasks to EMR DataLake clusters of version 5.15.1.
-
DataWorks CDH node official image: Supports
CDH Hive
,CDH Spark
,CDH Spark SQL
,CDH MR
,CDH Presto
, andCDH Impala
task types.
-
References
-
When using a custom image, you must select a serverless resource group for scheduling. For more information about serverless resource groups, see Create and use serverless resource groups.
-
To create a custom CDH task runtime environment when developing a custom image, see Develop tasks based on a self-built Hadoop cluster.
-
For additional details on PyODPS, see PyODPS.