Configure custom images for data development nodes - DataWorks

Image management in DataWorks allows users to create and manage custom runtime environments for task execution. This interface enables the creation of custom images that incorporate necessary development packages and dependencies tailored to specific execution environments. For instance, custom images can be used to install third-party dependencies essential for running PyODPS tasks. This topic describes the process for creating custom images using image management features.

Background information

By default, DataWorks utilizes the Default standard image when executing tasks, selecting the most suitable image based on the task type. Official images act as pre-configured base images, providing a standardized runtime environment for various task types. Custom images build upon these base images, offering enhanced functionality and flexibility. Users can tailor these images to their specific application needs, optimizing the execution efficiency and adaptability of data processing tasks. DataWorks supports three primary methods for customizing images:

Direct creation of custom images from DataWorks official images.
Refer to images from Alibaba Cloud ACR (Container Registry) for more information.
Create custom images from your personal development environments.

Instructions

The image management feature is only available in conjunction with a serverless resource group.

Note
If you are operating PyODPS task nodes with a legacy exclusive resource group for scheduling and depend on third-party packages, you can utilize the maintenance assistant. For more information, see how to configure third-party packages for exclusive resource groups for scheduling (not recommended).
The maximum number of custom images that can be created depends on the DataWorks edition:
- Basic and Standard Editions: 10.
- Professional Edition: 50.
- Enterprise Edition: 100.
Only the Professional Edition and higher support the image building feature.
A maximum of two images can be built simultaneously in each region.
If selecting the Default standard image for EMR-type tasks results in long wait times, it may be due to older EMR cluster version images not being initialized. To address this, submit a ticket.

Prerequisites

A serverless resource group has been created. This feature must be utilized in conjunction with a serverless resource group. For more information on serverless resource groups, see Create and use serverless resource groups.
(Optional) If the task's operating environment requires dependencies on third-party packages from the public network, the VPC associated with the serverless resource group must be capable of accessing the public network. For specific configuration details, see how to access the Internet using the SNAT feature of the public NAT Gateway.
You have either the AliyunDataWorksFullAccess or ModifyResourceGroup policy. For more information about authorization, see Product and console permission control details: RAM Policy.
Before creating a custom image from an ACR image, ensure that the Container Registry is enabled. For more information about the Container Registry ACR, see Container Registry ACR.

Step 1: Access image management

Log on to the DataWorks console.
Access the image management page.

In the left-side navigation pane, click Image Management to access the image management page.

Step 2: Create a custom image

DataWorks supports the creation of custom images using either DataWorks Official Images or Alibaba Cloud ACR Images as the base. The following describes the configuration parameters for different base image types:

Method 1: Create directly based on DataWorks official images

Configure the custom image parameters:

Parameter	Description
Image Name	The name of the custom image.
Image Description	The description of the custom image.
Reference Data Type	Select Dataworks Official Images.
Image Namespace	Fixed as DataWorks Default.
Image Repository	Fixed as DataWorks Default.
Image Name/id	DataWorks official images, supported options: dataworks_shell_task_pod dataworks_pyodps_task_pod dataworks_emr_datalake_5.15.1_task_pod dataworks_pyodps_py311_task_pod dataworks_python_task_pod dataworks_pairec_task_pod
Visibility	Support configuring the visibility of custom images, including Visible To Creator Only and Visible To All.
Sub-product Usage	The current custom image only supports Data Development.
Supported Task Types	DataWorks Shell node official image: Supports `Shell` task type. DataWorks PyODPS node official image: Supports `PyODPS 2` and `PyODPS 3` task types. DataWorks EMR datalake 5.15.1 version official image: Supports `EMR Spark`, `EMR Spark SQL`, and `EMR SHELL` task types.
Installation Package	Add the required third-party packages as needed. The following methods are supported: Quick installation: In the Installation Package drop-down selection box, select `Python2`, `Python3`, `Yum`. You can directly select the environment and resources to be installed. Manual input: In the Installation Package drop-down selection box, select `Script`. You can manually enter installation commands in the Script command box. You can choose the following manual input example commands to download third-party packages. pip example command: `pip install xx`, supports Python2. pip3 example command: `/home/tops/bin/pip3 install 'urllib3<2.0'`, supports Python3. yum example command: `yum install -y git`. wget example command: `wget git`.

Click OK.

Method 2: Create based on Alibaba Cloud ACR images

Conditions

DataWorks creation is only compatible with Alibaba Cloud ACR Enterprise Edition image instances.
DataWorks supports only selecting one VPC to access Alibaba Cloud ACR image instances.
DataWorks supports Alibaba Cloud ACR image instances up to 5 GB in size.

Configure the custom image parameters:

Parameter	Description
Image Name	The name of the custom image.
Image Description	The description of the custom image.
Reference Data Type	Select Alibaba Cloud ACR Images
Image Instance ID	Support selecting Enterprise Edition instances created in Alibaba Cloud Container Registry based on the instance ID. For more information about creating instances, see Create an Enterprise Edition instance.
Image Namespace	Support selecting the namespace under the image instance based on the selected instance. For more information about creating namespaces, see Create a namespace.
Image Repository	Support selecting the image repository under the image instance based on the selected instance. For more information about creating image repositories, see Create an image repository.
Image Version	Support selecting the image version of the custom image you need to create under the selected image repository.
Associated VPC	Select the VPC network bound to the image instance. For more information about configuring VPC networks, see Configure access control for virtual private clouds.
Visibility	Support configuring the visibility of custom images, including Visible To Creator Only and Visible To All.
Sub-product Usage	The current custom image only supports Data Development.
Supported Task Types	`Shell` `Python` `Notebook`: When running Notebook tasks in DataWorks using ACR images, use the Notebook base image provided by DataWorks as the base image for your ACR image to provide a runtime environment for Notebook tasks. DataWorks provides the Notebook base image: `dataworks-notebook:py3.11-ubuntu22.04:py3.11-ubuntu22.04-20241202` Note If you need to apply custom images created from Alibaba Cloud ACR images to Python tasks, confirm whether your ACR image instance contains a Python environment. Otherwise, Python tasks cannot be supported. If you need to apply custom images created from Alibaba Cloud ACR images to Notebook tasks, ensure that the environment used to build the image has public network access capabilities to obtain the Notebook base image provided by DataWorks normally.

Click OK.

Method 3: Create based on personal development environment instances

Data Studio's new data development feature allows you to create a new image from a personal development environment. For more information, see Create an image from a personal development environment.

Step 3: Publish a custom image

On the Custom Images tab, locate the custom image you created.
Click Publish in the Actions column.
Select the Test Resource Group and click Test next to Test Results.
Note
Choose a serverless resource group as the test resource group.
Once the test is successful, click Publish.

Note

Only images that pass the test can be published.
If your custom image retrieves third-party packages from the public network and consistently fails tests, verify that the VPC associated with the Test Resource Group has the capability to access the public network. For details on configuring public network access for VPCs, see Access the Internet using the SNAT feature of the public NAT Gateway.
If the test fails, you can click the Operation column of the target custom image and select > Edit to modify the image configuration.

Step 4: Modify the image ownership space

On the Custom Images tab, locate the published custom image.
In the Operation column for the desired image, click > Modify Associated Workspace to attach the custom image to the associated workspace.

Step 5: Build a permanent image

After completing Step 3, custom images are typically ready for use in business scenarios. However, each time a task node runs, DataWorks redeploy the image environment and download third-party packages, potentially increasing node runtime and incurring additional compute and traffic costs. To address this, DataWorks enables the conversion of custom images into permanent images, ensuring a consistent runtime environment for each task node execution, thereby saving time and reducing costs.

Note

Building permanent images is only supported for custom images created from official images.

Follow these steps:

Log on to the DataWorks console, switch to the appropriate region, and click Image Management in the left-side navigation pane.
On the Custom Images tab, locate the published custom image.
In the Operation column for the image, click > Build to initiate the creation of a permanent image.
In the Select The Resource Group For Building The Image dialog box, select the resource group for image building, then click Continue.
Note
- Image building typically takes about 5 to 10 minutes, but the exact time may vary depending on the image size.
- Building an image will result in computing charges calculated at 0.5 CU × the duration of the build. For more information, see the description of data computing billing.
- To prevent build failures due to network issues or other reasons, ensure that the Resource Group For Building The Image is the same as the Test Resource Group selected in Step 3: Publish a Custom Image.

What to do next: Use the image

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Within the data development feature, locate the task node for the custom image, click Scheduling Configuration on the right, and set the resource properties:
- Scheduling Resource Group: Choose a serverless resource group.
  Note
  - To ensure smooth task node operation, the Scheduling Resource Group should match the Test Resource Group used during Image Publishing.
  - If the desired resource group is not listed, check if it's associated with the current workspace. Visit the Resource Group List page, locate the resource group, and click Bind Workspace in the Actions column to bind it.
- Image: Select the published image.
Save and submit the changes.

Note
Modifications made to the image in data development will not automatically synchronize to the production environment. You must publish the task to apply the changes in production.

Example: Use images to perform Chinese word segmentation through PyODPS nodes

If you need to segment Chinese text within a column of a MaxCompute table and store the results in another table for downstream scheduling nodes, you can install the jieba segmentation toolkit in a custom image. Then, use this image to process the segmentation of the Chinese text via PyODPS tasks and save the outcomes in a new table, ensuring smooth integration with the downstream scheduling flow.

Create test data.

Create a MaxCompute data source and bind it within DataWorks data development. For more information, see Create a MaxCompute data source.

Create an ODPS node in data development, establish a test table, and insert test data.

Note

The example below utilizes scheduling parameters. Set the parameter name to bday and the value to $[yyyymmdd] in the Scheduling Configuration on the right.

Create a test table.

-- Create a test table
CREATE TABLE IF NOT EXISTS custom_img_test_tb
(
    c_customer_id BIGINT NOT NULL,
    c_customer_text STRING NOT NULL,
    PRIMARY KEY (c_customer_id)
)
COMMENT 'TABLE COMMENT'
PARTITIONED BY (ds STRING COMMENT 'Partition')
LIFECYCLE 90;

-- Insert test data into the test table
INSERT INTO custom_img_test_tb PARTITION (ds='${bday}') (c_customer_id, c_customer_text) VALUES
(1, '晚来天欲雪，能饮一杯无？'),
(2, '月落乌啼霜满天，江枫渔火对愁眠。'),
(3, '山重水复疑无路，柳暗花明又一村。'),
(4, '春眠不觉晓，处处闻啼鸟。'),
(5, '静夜思，床前明月光，疑是地上霜。'),
(6, '海上生明月，天涯共此时。'),
(7, '旧时王谢堂前燕，飞入寻常百姓家。'),
(8, '一行白鹭上青天，窗含西岭千秋雪。'),
(9, '人生得意须尽欢，莫使金樽空对月。'),
(10, '天生我材必有用，千金散尽还复来。');

Save and publish.

Create a custom image.

Refer to Step 2: Create a custom image. Key parameters include the following:
- Image name/ID: Choose dataworks_pyodps_task_pod, the official DataWorks PyODPS node image.
- Supported task types: Select PyODPS 3.
- Installation package: Choose Python3 and jieba.
Publish the custom image and update the ownership project space. For more information, see Step 3: Publish a custom image and Step 4: Modify the image ownership space.

Use the custom image in a scheduling task.

Create and configure a PyODPS 3 node in data development with the following details:

Use the custom image.

import jieba
from odps import ODPS
from odps.models import TableSchema as Schema, Column, Partition

# Read data from the test table
table = o.get_table('custom_img_test_tb')
partition_spec = f"ds={args['bday']}"
with table.open_reader(partition=partition_spec) as reader:
    records = [record for record in reader]

# Segment the extracted text
participles = [' | '.join(jieba.cut(record['c_customer_text'])) for record in records]

# Create a destination table
if not o.exist_table("participle_tb"):
    schema = Schema(columns=[Column(name='word_segment', type='string', comment='Segmentation result')], partitions=[Column(name='ds', type='string', comment='Partition field')])
    o.create_table("participle_tb", schema)

# Write the segmentation result to the destination table
# Define an output partition and an output table
output_partition = f"ds={args['bday']}"
output_table = o.get_table("participle_tb")

# If the partition does not exist, create a partition first
if not output_table.exist_partition(output_partition):
    output_table.create_partition(output_partition)

# Write the segmentation result to the output table
record = output_table.new_record()
with output_table.open_writer(partition=output_partition, create_partition=True) as writer:
    for participle in participles:
        record['word_segment'] = participle
        writer.write(record)

On the Properties tab, configure the following key settings:
- Scheduling parameters: Name bday, value $[yyyymmdd].
- Scheduling Resource Group: Choose a serverless resource group, the same as the Test Resource Group used when Publishing The Image.
- Image: Select the published custom image bound to the current workspace.
Save, configure parameters, and run the node.
(Optional) Execute the following SQL statement in an ad hoc query to verify the output table contains data.
```
SELECT * FROM participle_tb WHERE ds=<partition date>;
```
Deploy the PyODPS node to the production environment.

Note
The image updated in data development won't sync to the production environment. You must publish the task to apply changes in production.

Build the custom image as a permanent solution. For more information, see Step 5: Build a permanent image.

Appendix: View official images

Log on to the DataWorks console, switch to the region where your DataWorks workspace is located, and click Image Management in the left-side navigation pane.
View the official images available for DataWorks. The following official images are supported:
- DataWorks Shell node official image: Supports Shell task types.
- DataWorks PyODPS node official image: Supports PyODPS 2 and PyODPS 3 task types.
- DataWorks EMR datalake 5.15.1 version official image: Supports EMR Spark, EMR Spark SQL, and EMR SHELL task types.
  
  Note
  You can use this image to submit tasks to EMR DataLake clusters of version 5.15.1.
- DataWorks CDH node official image: Supports CDH Hive, CDH Spark, CDH Spark SQL, CDH MR, CDH Presto, and CDH Impala task types.

References

When using a custom image, you must select a serverless resource group for scheduling. For more information about serverless resource groups, see Create and use serverless resource groups.
To create a custom CDH task runtime environment when developing a custom image, see Develop tasks based on a self-built Hadoop cluster.
For additional details on PyODPS, see PyODPS.