All Products
Search
Document Center

DataWorks:Manage images

Last Updated:Apr 15, 2025

DataWorks provides the image management feature that you can use to create and manage task running environments. If a specific environment is required for task running, you can use the image management feature to create a custom image that integrates required development packages and dependencies. For example, you can use a custom image to install third-party dependencies that are required to run PyODPS tasks. This topic describes how to use the image management feature to create a custom image.

Background information

By default, DataWorks uses the Default standard image to run tasks. DataWorks provides an appropriate image based on the tasks that you want to run. The official images serve as pre-configured base images to provide a standardized runtime environment for tasks of specific types. Custom images provide enhanced functionality and flexibility on the basis of the official images. You can expand the base images based on your actual application scenarios to achieve on-demand feature customization. This helps improve the execution efficiency and flexibility of data processing tasks.

  • Methods to create a custom image:

  • Node types that support custom images and image creation methods:

    Node type

    Creation by referencing an image in Container Registry

    Creation based on a DataWorks official image

    PyODPS 2

    image

    image

    PyODPS 3

    image

    image

    EMR Spark

    image

    image

    EMR Spark SQL

    image

    image

    EMR Shell

    image

    image

    Shell

    image

    image

    Python

    image

    image

    Notebook

    image

    image

Usage notes

  • The image management feature can be used together with only a serverless resource group.

    Note

    If a third-party package is required when you use an old-version exclusive resource group for scheduling to run PyODPS nodes, you can use the O&M Assistant feature to install the third-party package. For more information, see Use an exclusive resource group for scheduling to configure a third-party open source package (not recommended).

  • The maximum number of custom images that can be created varies based on the DataWorks edition.

    • DataWorks Basic Edition and Standard Edition: 10

    • DataWorks Professional Edition: 50

    • DataWorks Enterprise Edition: 100

  • Only DataWorks Professional Edition or a more advanced edition supports the image building feature.

  • A maximum of two images can be built at the same time in each region.

Prerequisites

Step 1: Go to the Image Management page

  1. Log on to the DataWorks console.

  2. Go to the Image Management page.

    In the left-side navigation pane, click Image Management.

    image

Step 2: Create a custom image

When you create a custom image in the DataWorks console, you can set the Reference Type parameter to DataWorks Official Image or Alibaba Cloud Container Registry Image. The parameters that are configured to create a custom image vary based on the reference type that you select.

Method 1: Create a custom image based on a DataWorks official image

  1. Configure parameters that are described in the following table.

    Parameter

    Description

    Image Name

    The name of the custom image.

    Image Description

    The description of the custom image.

    Reference Type

    Select DataWorks Official Image.

    Image Namespace

    The value of this parameter is fixed to DataWorks Default.

    Image Repository

    The value of this parameter is fixed to DataWorks Default.

    Image Name/ID

    Select a DataWorks official image based on which you want to create a custom image.

    Visible Scope

    The scope in which the custom image is visible. Valid values: Visible Only to Creator and Visible to all.

    Module

    The service to which the custom image can be applied. This parameter can only be set to DataStudio.

    Supported Task Type

    • dataworks_shell_task_pod: available for Shell tasks

    • dataworks_pyodps_task_pod: available for PyODPS 2 and PyODPS 3 tasks

    • dataworks_emr_datalake_5.15.1_task_pod: available for E-MapReduce (EMR) Spark, EMR Spark SQL, and EMR Shell tasks

    Installation Package

    The third-party package that you want to use. You can use one of the following methods to install a third-party package:

    • Quick installation: Select Python2, Python3, or Yum from the Installation Package drop-down list and then select a desired environment or resource.

    • Manual input: Select Script from the Installation Package drop-down list. Then, write commands in the command box to install a desired third-party package. You can run one of the following commands to install a third-party package:

      • pip install xx for Python 2

      • /home/tops/bin/pip3 install 'urllib3<2.0' for Python 3

      • yum install -y git

      • wget git

  2. Click OK.

Method 2: Create a custom image by referencing an image in Container Registry

Limits

  • DataWorks allows you to reference only an image in Container Registry Enterprise Edition.

  • DataWorks allows you to access a Container Registry instance used to build an image only over a VPC.

  • A Container Registry instance that you can use in DataWorks cannot exceed 5 GB in size.

  1. Configure parameters that are described in the following table.

    Parameter

    Description

    Image Name

    The name of the custom image.

    Image Description

    The description of the custom image.

    Reference Type

    Select Alibaba Cloud Container Registry Image.

    Image Instance ID

    Select a Container Registry Enterprise Edition instance that is created in Alibaba Cloud Container Registry by instance ID. For information about how to create an instance, see Create a Container Registry Enterprise Edition instance.

    Image Namespace

    Select a namespace based on the selected instance. For information about how to create a namespace, see Create a namespace.

    Image Repository

    Select an image repository based on the selected instance. For information about how to create an image repository, see Create an image repository.

    Image Version

    Select a version for the custom image that you want to create based on the selected image repository.

    VPC to Associate

    Select a VPC with which the instance needs to be associated. For information about how to configure a VPC, see Configure a VPC ACL.

    Visible Scope

    The scope in which the custom image is visible. Valid values: Visible Only to Creator and Visible to all.

    Module

    The service to which the custom image can be applied. This parameter can only be set to DataStudio.

    Supported Task Type

    A Container Registry image is started by using startup commands and a task code file path. The following information describes supported task types and default startup commands:

    • Shell

    • Python

      If you want to apply a custom image that is created based on an Alibaba Cloud Container Registry image to a Python task, you must make sure that the desired Container Registry instance contains a Python environment.

    • Notebook

      • If you want to apply a custom image that is created based on an Alibaba Cloud Container Registry image to a notebook task, use the following basic notebook image provided by DataWorks as the base image of the Container Registry image to provide a runtime environment for the notebook task: dataworks-public-registry.cn-shanghai.cr.aliyuncs.com/public/dataworks-notebook:py3.11-ubuntu22.04-20241202

      • Make sure that the environment that you use to create an image has access to the Internet. This way, you can obtain the basic notebook image provided by DataWorks as expected.

  2. Click OK.

Method 3: Create a custom image based on a personal development environment instance

New-version Data Studio allows you to create an image for a personal development environment. For more information, see Create an image of a personal development environment instance.

Step 3: Publish the custom image

  1. On the Custom Images tab, find the created custom image.

  2. Click Publish in the Actions column.

  3. In the Publish Image panel, configure the Test Resource Group parameter and click Test to the right of Test Result.

    Note

    Select a serverless resource group for Test Resource Group.

  4. After the test succeeds, click Publish.

Note
  • Only images that pass the test can be published.

  • If you configure a third-party package that is deployed over the Internet as a custom image and the image cannot pass the test after a long period of time, check whether the VPC with which the selected test resource group is associated can access the Internet. If the VPC cannot access the Internet, enable Internet access for the VPC. For more information, see Use the SNAT feature of an Internet NAT gateway to access the Internet.

  • If images fail to pass the test, you can perform the following operations to modify image configurations: Find a desired custom image, move the pointer over the image icon in the Actions column, and then select Modify.

Step 4: Associate the custom image with a workspace

  1. On the Custom Images tab, find the custom image that is published.

  2. Move the pointer over the image icon in the Actions column and select Change Workspace to associate the custom image with a workspace.

Step 5: Build a permanent image

After you complete the operations in Step 3, you can use the custom image as expected in your business. However, each time you run a node that uses the custom image, DataWorks redeploys the image environment and downloads a third-party package. As a result, the node running duration is extended and more computing fees may be generated. In this case, DataWorks allows you to create custom images as permanent images. This way, the same image environment can be used each time you run a node, which frees you from repeatedly deploying an image environment. This ensures the consistency of the runtime environment and reduces task running duration, computing costs, and traffic costs.

Note

You can create only custom images that are created based on official images as permanent images.

Perform the following steps to build a permanent image:

  1. Log on to the DataWorks console. In the top navigation bar, select a desired region. Then, click Image Management in the left-side navigation pane.

  2. On the Image Management page, click the Custom Images tab. On the Custom Images tab, find the custom image that is published.

  3. Move the pointer over the image icon in the Actions column and select Create.

  4. In the Resource Group for Which You Want to Create Image dialog box, select a resource group that you want to use from the drop-down list and click Continue.

    Note
    • It takes approximately 5 to 10 minutes to complete image building. The actual time that is required varies based on the size of the image that you want to build.

    • You are charged computing fees when you build an image. The computing fees are calculated by using the following formula: 0.5 CUs × Duration for image building. For more information, see Billing of data computing.

    • An image may fail to be built due to network exceptions. To prevent such issue, make sure that the resource group that you selected in the Resource Group for Which You Want to Create Image dialog box is the test resource group that you selected in Step 3: Publish the custom image in this topic.

What to do next: Use the custom image

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. Find a desired node on the DataStudio page and double-click the node name to go to the configuration tab of the node. Click Properties in the right-side navigation pane and configure parameters in the Resource Group section.

    • Resource Group: Select a serverless resource group.

      Note
      • Make sure that the Resource Group parameter is set to the test resource group that you selected when you published the custom image to ensure smooth running of the node.

      • If the desired resource group is not displayed from the Resource Group drop-down list, check whether the resource group is associated with the current workspace. If the resource group is not associated with the current workspace, you can perform the following operations to complete the association: Go to the Resource Groups page. Find the desired resource group and click Associate Workspace in the Actions column.

    • Image: Select a custom image that is published.

    image

  3. Save and commit the node.

    Note

    The image that is selected in DataStudio cannot be synchronized to the production environment. You must follow the instructions that are described in Deploy nodes to deploy the node to allow the image to take effect in the production environment.

Example: Use an image in a PyODPS node to segment data in Chinese

You want to segment a column of data in Chinese in a MaxCompute table on a node and store the segmentation result in another table for a descendant node to use. In this case, you can pre-install the segmentation tool Jieba in a desired custom image and run a PyODPS task to use the custom image to segment data in Chinese in the MaxCompute table and store the segmentation result in another table. This way, the descendant node can seamlessly schedule the data.

  1. Create test data.

    1. Add a MaxCompute data source to DataWorks, and associate the MaxCompute data source with DataStudio. For more information about how to add a MaxCompute data source, see Add a MaxCompute data source.

    2. In DataStudio, create an ODPS node, create a test table, and then add test data to the table.

      Note

      In the following example, a scheduling parameter is used. On the Properties tab in the right-side navigation pane of the configuration tab of the node, add a parameter whose name is bday and value is $[yyyymmdd] in the Scheduling Parameter section.

      Create a test table.

      -- Create a test table.
      CREATE TABLE IF NOT EXISTS custom_img_test_tb
      (
          c_customer_id BIGINT NOT NULL,
          c_customer_text STRING NOT NULL,
          PRIMARY KEY (c_customer_id)
      )
      COMMENT 'TABLE COMMENT'
      PARTITIONED BY (ds STRING COMMENT 'Partition')
      LIFECYCLE 90;
      
      -- Insert test data into the test table.
      INSERT INTO custom_img_test_tb PARTITION (ds='${bday}') (c_customer_id, c_customer_text) VALUES
      (1, '晚来天欲雪,能饮一杯无? '),
      (2, '月落乌啼霜满天,江枫渔火对愁眠。 '),
      (3, '山重水复疑无路,柳暗花明又一村。 '),
      (4, '春眠不觉晓,处处闻啼鸟。 '),
      (5, '静夜思,床前明月光,疑是地上霜。 '),
      (6, '海上生明月,天涯共此时。 '),
      (7, '旧时王谢堂前燕,飞入寻常百姓家。 '),
      (8, '一行白鹭上青天,窗含西岭千秋雪。 '),
      (9, '人生得意须尽欢,莫使金樽空对月。 '),
      (10, '天生我材必有用,千金散尽还复来。 ');
    3. Save and deploy the node.

  2. Create a custom image.

    Follow the instructions that are described in Step 2: Create a custom image in this topic to create a custom image. Settings of key parameters:

    • Image Name/ID: Select dataworks_pyodps_task_pod.

    • Supported Task Type: Select PyODPS 2 and PyODPS 3.

    • Installation Package: Select Python3 and jieba.

  3. Publish the custom image and associate the custom image with a workspace. For more information, see the Step 3: Publish the custom image and Step 4: Associate the custom image with a workspace sections in this topic.

  4. Use the custom image in a scheduling task.

    1. In DataStudio, create and configure a PyODPS 3 node.

      Use the custom image

      import jieba
      from odps import ODPS
      from odps.models import TableSchema as Schema, Column, Partition
      
      # Read data from the test table.
      table = o.get_table('custom_img_test_tb')
      partition_spec = f"ds={args['bday']}"
      with table.open_reader(partition=partition_spec) as reader:
          records = [record for record in reader]
      
      # Segment the extracted text.
      participles = [' | '.join(jieba.cut(record['c_customer_text'])) for record in records]
      
      # Create a destination table.
      if not o.exist_table("participle_tb"):
          schema = Schema(columns=[Column(name='word_segment', type='string', comment='Segmentation result')], partitions=[Column(name='ds', type='string', comment='Partition field')])
          o.create_table("participle_tb", schema)
      
      # Write the segmentation result to the destination table.
      # Define an output partition and an output table.
      output_partition = f"ds={args['bday']}"
      output_table = o.get_table("participle_tb")
      
      # If the partition does not exist, create a partition first.
      if not output_table.exist_partition(output_partition):
          output_table.create_partition(output_partition)
      
      # Write the segmentation result to the output table.
      record = output_table.new_record()
      with output_table.open_writer(partition=output_partition, create_partition=True) as writer:
          for participle in participles:
              record['word_segment'] = participle
              writer.write(record)
    2. On the Properties tab in the right-side navigation pane of the configuration tab of the node, configure the following key settings:

      • Add a scheduling parameter whose name is bday and value is $[yyyymmdd] in the Scheduling Parameter section.

      • Select a serverless resource group, which is the test resource group that you used when you published the custom image, as a resource group for scheduling.

      • Select the custom image that is published and associated with the current workspace.

    3. Save and run the node with parameters configured.

    4. Optional. Create an ad hoc query and execute the following SQL statement to check whether the output table contains data:

      SELECT * FROM participle_tb WHERE ds=<Partition date>;

      image

    5. Deploy the PyODPS node to the production environment.

      Note

      The image that is selected in DataStudio cannot be synchronized to the production environment. You must follow the instructions that are described in Deploy nodes to deploy the node to allow the image to take effect in the production environment.

  5. Create the custom image as a permanent image. For more information, see the Step 5: Build a permanent image section in this topic.

Appendix: View official images

The following table describes the official images supported by DataWorks. You can also go to the Image Management page to view the official images.

Image name

Supported task type

Description

dataworks_pyodps_py311_task_pod

PyODPS 3

This image is suitable for only PyODPS 3 tasks.

dataworks_pairec_task_pod

PyODPS 3

This image is suitable for PyODPS 3 tasks to run an algorithm process generated by the PAI-Rec engine.

dataworks_pyodps_task_pod

PyODPS 2 and PyODPS 3

-

dataworks_emr_datalake_5.15.1_task_pod

EMR Spark, EMR Spark SQL, and EMR Shell

You can use this image to commit tasks in EMR DataLake clusters of V5.15.1.

dataworks_shell_task_pod

Shell

-

dataworks_python_task_pod

Python

-

dataworks_cdh_custom_task_pod

CDH Hive, CDH Spark, CDH Spark SQL, CDH MR, CDH Presto, and CDH Impala

  • This image is available in the China (Beijing) and China (Zhangjiakou) regions.

  • You must install the CDH Parcel directory for this image.

References