All Products
Search
Document Center

DataWorks:Custom images

Last Updated:Oct 11, 2025

Create a custom image when the default DataWorks runtime lacks required dependencies for your PyODPS or Shell tasks (e.g., Python libraries like pandas or jieba). Custom images package all dependencies into a reusable, standardized runtime environment, ensuring consistency and significantly improving development efficiency.

Usage notes

  • Edition requirements:

  • Resource group support: Custom images work only with serverless resource groups.

    Legacy resource groups: Use O&M Assistant to install external dependencies.
  • Permissions required: You need one of the following policies: AliyunDataWorksFullAccess or ModifyResourceGroup.

    For more information, see Product and console access control: RAM Policy.

Quotas and limits

  • Image count limits:

    • Basic and standard editions: 10.

    • Professional Edition: 50.

    • Enterprise Edition: 100.

  • Build concurrency: Maximum 2 concurrent builds per region.

  • ACR image requirements:

    • Instance edition: Enterprise edition.

    • Architecture: AMD64.

    • Image size: Maximum 5 GB per image.

  • Persistent builds: Only supported for images built from DataWorks official images (not ACR images).

  • Supported node types:

    Node type

    Build from official images

    Build from ACR images

    PyODPS2

    Supported

    Not supported

    PyODPS3

    Supported

    Not supported

    EMR Spark

    Supported

    Not supported

    EMR Spark SQL

    Supported

    Not supported

    EMR SHELL

    Supported

    Not supported

    Shell

    Supported

    Supported

    Python

    Supported

    Supported

    Notebook

    Not supported

    Supported

    CDH

    Supported

    Not supported

    Assignment node

    Supported

    Not supported

Procedure

Create a custom image

Choose one of three methods to create your custom image:

Option 1: Build from DataWorks official images

  1. Log in to the DataWorks console and click Image Management in the left navigation pane.

  2. On the DataWorks Official Images tab, select your base image and click Create Custom Image in the Actions column.

  3. Configure the following parameters:

    Parameter

    Description

    Image Name/ID

    The selected official image (you can switch if needed).

    Visible Scope

    Visible Only to Creator or Visible to all.

    Module

    Currently limited to DataStudio.

    Supported Task Type

    Select node types that can use this image.

    Installation Package

    Add third-party packages using one of these methods:

    • Quick install: Select Python2, Python3, or Yum from the dropdown and choose packages.

    • Script mode: Select Script and manually enter installation commands:

    Important

    To install or depend on a third-party package from the Internet, the virtual private cloud (VPC) attached to the Serverless resource group must have Internet access.

  1. Click OK.

Option 2: Build from ACR Images

To create custom images from ACR, enable Container Registry first.

  1. Log in to the DataWorks console and click Image Management in the left navigation pane.

  2. On the Custom Images tab, click Create Image and configure:

    Parameter

    Description

    Reference Type

    Select Alibaba Cloud Container Registry Image.

    Image Instance ID

    Select your ACR enterprise instance.

    Image Namespace

    Select a namespace from the instance.

    Image Repository

    Select an image repository.

    Image Version

    Select the version to use.

    VPC to Associate

    Select the VPC bound to your ACR instance. For more information, see Configure VPC access.

    Important

    DataWorks supports selecting only one VPC per ACR instance.

    Synchronize to MaxCompute

    Defaults to No and is available only for ACR instances running Standard Edition or higher.

    Visible Scope

    Visible Only to Creator or Visible to all.

    Module

    Currently limited to DataStudio.

    Supported Task Type

    ACR images use entrypoint: startup command + task_script_path.

    • Shell: The default command.

    • Python: Ensure your ACR base image includes a Python runtime

    • Notebook

      • Use DataWorks Notebook base image: dataworks-public-registry.cn-shanghai.cr.aliyuncs.com/public/dataworks-notebook:py3.11-ubuntu22.04-20241202.

      • Ensure your build environment has internet access to pull this base image.

  3. Click OK.

Option 3: Build from personal development environment

Data Studio supports creating images from personal development environments. For more information, see Create a DataWorks image from a personal development environment.

Test and publish the image

  1. In the DataWorks console, go to Image Management > Custom Images.

  2. Locate your image and click Publish in the Actions column.

  3. If the test fails, you can click image > Modify to update the image configuration.

Notes:

  • Resource group: Select a serverless resource group.

  • VPC consistency: For ACR or personal environment images, ensure the Serverless resource group and ACR instance use the same VPC.

  • Internet access: If the test times out while fetching packages, verify your test resource group's VPC has Internet access.

Assign the image to workspaces

After publishing, assign the image to workspaces:

  1. On the Image Management > Custom Images tab, find your published image.

  2. Click image > Change Workspace in the Actions column.

Use the image in tasks

New version of Data Studio:

  1. Go to Data Studio: Go to the DataWorks Workspaces page, switch to your target region, find your workspace, and click Shortcuts > DataStudio.

  2. Configure the image: In your task node, click Scheduling in the right pane.

    • Resource Group: Select a serverless resource group.

      If the target resource group is not displayed, go to the Resource Group page and click Associate Workspace.
      Important

      Ensure this resource group matches the test resource group used when publishing the image.

    • Image: Select your published custom image.

      Changes to the image require republishing the node to take effect in production.

      image

  3. Debug the node: In the Debugging Configurations pane, configure Computing Resource, Resource Group, CUs For Computing, Image, and Script Parameters, then click Running Duration in the toolbar.

  4. Publish the node: Click Publish in the toolbar to deploy to production.

Legacy version of DataStudio

  1. Go to DataStudio: Log on to the DataWorks console, switch to your region, click Data Development and O&M > Data Development, select your workspace, and click Go to Data Development.

  2. Configure the image: In your task node, click Properties in the right pane.

    • Resource Group: Select a serverless resource group.

      If the target resource group is not displayed, go to the Resource Group page and click Associate Workspace.
      Important

      Ensure this resource group matches the test resource group used when publishing the image.

    • Image: Select your published custom image.

      Changes to the image require republishing the node to take effect in production.

      image

  3. Debug the node: Click Run with Parameters (image), configure Resource Group Name, CUs for Running, and Image, then click Run.

  4. Publish the node: Click Save and Submit to deploy to production.

Build a persistent image

Important

We strongly recommend building persistent images after publishing and testing. This prevents runtime failures caused by upstream dependency changes or unspecified versions.

Regular custom images redeploy on every run, increasing execution time and compute costs. Persistent images build once and reuse indefinitely, improving efficiency and reducing costs.

  1. Go to Image Management > Custom Images and locate your published image.

  2. Click image > Build in the Actions column.

  3. In the Resource Group for Which You Want to Create Image dialog, select a resource group and click Continue.

    Important

    To prevent network-related failures, ensure this resource group matches the test resource group used when publishing.

  4. Building takes 5-10 minutes depending on image size. Upon success, the status changes to Published (Build Succeeded).

Billing

Image builds are charged as: CU count × Build duration. The system allocates 0.5 CUs by default. For more information about billing, see Billing for Serverless resource groups.

Production best practices

Follow these recommendations for stable, efficient, cost-effective use in production:

  • Persistent image: Build persistent images from published configurations with stable dependencies. This eliminates reinstallation on every run, reducing startup time, compute costs, and improving stability.

  • Environment consistency: Ensure VPCs and network configurations match across test, build, and production serverless resource groups, especially when accessing private ACR repositories or the internet.

  • Version locking: When installing dependencies via Script mode, always specify versions such as pip install pandas==1.5.3. This prevents unexpected behavior from upstream library updates.

  • Rollback plan: If production tasks fail after updating an image:

    • Roll back via task publishing history.

    • Revert to a previous stable version in scheduling configuration.

Example

This example demonstrates using a custom image with PyODPS to perform word segmentation on a MaxCompute table. Segment text in a MaxCompute table column and store results in another table for downstream scheduling.

  1. Create test data.

    1. Create a DataWorks workspace with attached MaxCompute resources. For more information, see Create a workspace, Add a data source or register a cluster to a workspace, and Associate a computing resource.

    2. In Data Studio, create an ODPS node (legacy) or MaxCompute SQL node (new version):

      Note

      This example uses scheduling parameters. Set parameter name to bday with value to $[yyyymmdd] in the Scheduling pane.

      Create a test table.

      -- Create test table
         CREATE TABLE IF NOT EXISTS custom_img_test_tb
         (
             c_customer_id BIGINT NOT NULL,
             c_customer_text STRING NOT NULL,
             PRIMARY KEY (c_customer_id)
         )
         COMMENT 'Test table for custom image demo'
         PARTITIONED BY (ds STRING COMMENT 'Partition')
         LIFECYCLE 90;
      
         -- Insert test data
         INSERT INTO custom_img_test_tb PARTITION (ds='${bday}') (c_customer_id, c_customer_text) VALUES
         (1, 'The sky is getting dark and it looks like it will snow. Would you like a cup of wine?'),
         (2, 'The moon sets, crows caw, and frost fills the sky. I lie awake, facing the river maples and fishing lights.'),
         (3, 'Mountains and rivers seem to block the way. But among shady willows and bright flowers, another village appears.'),
         (4, 'I sleep in spring, unaware of the dawn. Everywhere I hear the birds sing.'),
         (5, 'Thoughts on a quiet night. Moonlight shines before my bed. I mistake it for frost on the ground.'),
         (6, 'The bright moon rises over the sea. We share this moment, though we are far apart.'),
         (7, 'The swallows that once graced the halls of nobles now fly into the homes of common people.'),
         (8, 'A line of egrets flies up to the blue sky. My window frames the ancient snow on the Western Hills.'),
         (9, 'When life is good, enjoy it to the fullest. Do not let the golden goblet face the moon empty.'),
         (10, 'Heaven gave me talent, so it must be used. A thousand pieces of gold, once spent, can be earned again.');
    3. Save and publish the node.

  2. Create a custom image.

    Create a custom image with these key parameters:

    • Image Name/ID: Select dataworks_pyodps_task_pod (DataWorks official PyODPS image)

    • Supported Task Type: PyODPS2 and PyODPS 3.

    • Installation Package: Select Python 3 and add jieba.

  3. Publish and assign images.

    Publish the image and assign it to your workspace. For more information, see Test and publish the image and Assign the image to workspaces.

  4. Create PyODPS Tasks.

  5. In Data Studio, create a PyODPS3 node with this code

    Use the custom image.

    import jieba
       from odps import ODPS
       from odps.models import TableSchema as Schema, Column, Partition
    
       # Read data from the source table
       table = o.get_table('custom_img_test_tb')
       partition_spec = f"ds={args['bday']}"
       with table.open_reader(partition=partition_spec) as reader:
           records = [record for record in reader]
    
       # Perform word segmentation
       participles = [' | '.join(jieba.cut(record['c_customer_text'])) for record in records]
    
       # Create destination table
       if not o.exist_table("participle_tb"):
           schema = Schema(
               columns=[Column(name='word_segment', type='string', comment='Segmentation result')], 
               partitions=[Column(name='ds', type='string', comment='Partition field')]
           )
           o.create_table("participle_tb", schema)
    
       # Write results to destination table
       output_partition = f"ds={args['bday']}"
       output_table = o.get_table("participle_tb")
    
       # Create partition if it doesn't exist
       if not output_table.exist_partition(output_partition):
           output_table.create_partition(output_partition)
    
       # Write segmentation results
       record = output_table.new_record()
       with output_table.open_writer(partition=output_partition, create_partition=True) as writer:
           for participle in participles:
               record['word_segment'] = participle
               writer.write(record)
  6. Configure scheduling parameters in the right pane:

    • Scheduling Parameters: bday = $[yyyymmdd].

    • Resource Group: Same Serverless group used for image testing.

    • Image: Your published custom image.

  7. Debug the node.

    • Legacy version: Click Run with Parameters (image), configure settings, and click Run.

    • New version: Configure in Debugging Configurations pane and click Running Duration in toolbar

  8. (Optional) Verify results with a SQL query:

    -- Replace <partition_date> with actual date
       SELECT * FROM participle_tb WHERE ds=<partition_date>;
  9. Publish the PyODPS node to production.

    Note

    Image changes in Data Studio don't sync to production automatically. You need to publish the task for changes to take effect. For more information, see Deploy nodes or Node or workflow deployment.

  10. Build persistent Images.

    Build your custom image as a persistent image. For more information, see Build a persistent image.

References

Appendix: Installation command reference

When using Script mode to configure installation commands:

  • For PyODPS 2 Dependencies:

    pip install <package_name>
    Note

    If prompted to upgrade pip, run: pip install --upgrade pip.

  • For PyODPS 3 dependencies:

    /home/tops/bin/pip3 install <package_name>
    Note
    • If prompted to upgrade pip, run: /home/tops/bin/pip3 install --upgrade pip.

    • If you encounter error /home/admin/usertools/tools/cmd-0.sh: line 3: /home/tops/bin/python3: No such file or directory, submit a ticket to request permissions.

  • Python mirror sources

    Switch to these public mirrors as needed:

    Organization

    Mirror URL

    Alibaba Cloud (Aliyun)

    https://mirrors.aliyun.com/pypi/simple/

    Important

    No internet access required for Alibaba Cloud mirrors.

    Tsinghua University

    https://pypi.tuna.tsinghua.edu.cn/simple

    USTC

    https://pypi.mirrors.ustc.edu.cn/simple/