All Products
Search
Document Center

DataWorks:Use datasets

Last Updated:Jun 21, 2026

You can use datasets in nodes, such as Shell, Python, and Notebook nodes, to read from and write to Object Storage Service (OSS) or Apsara File Storage NAS (NAS) during data development. You can also mount a dataset as storage when you create an instance of a personal development environment.

Important

We recommend reading Manage datasets to learn how to create a dataset.

Overview

Datasets let you read and write data stored in OSS and NAS from DataWorks. You can create multiple dataset versions, track changes, and revert to a previous version if needed.

Limitations

  • Datasets are supported only in the new version of DataStudio.

  • Resource group: You can access datasets from data development nodes only through a Serverless resource group.

  • Supported objects: Datasets are supported only in Shell nodes, Python nodes, Notebooks for basic development, and personal development environments. You can mount a maximum of 5 datasets to each object.

  • Storage type: Datasets support Object Storage Service (OSS) and Apsara File Storage NAS (NAS) using the NFS protocol.

  • Permissions: If a dataset mount target is set to read-only, you cannot modify or delete the folders or files within it. Doing so results in a permission error.

Use datasets in nodes

This section explains how to mount an OSS dataset to a node. In this example, you create a DataWorks dataset that is backed by OSS, mount the OSS path oss://datasets-oss.oss-cn-shanghai.aliyuncs.com/dataset01/v1/ to the dataset mount path /mnt/data/dataset01, and then read from and write to data in the node code.

Prerequisites: create a dataset

  1. Create a bucket or create a file system.

    This example uses an OSS dataset. Create a bucket named datasets-oss in the China (Shanghai) region, and then create the /dataset01/v1 directory.

  2. Create a dataset.

    In this example, create an OSS dataset named datasets-oss and mount the OSS path oss://datasets-oss.oss-cn-shanghai.aliyuncs.com/dataset01/v1/ to /mnt/data/dataset01.

1. Configure the dataset for a node

In the Debug Configuration of a Shell or Python node, configure the datasets-oss dataset.

Important
  • Before you publish the node, you must also add the dataset in the Scheduling Settings section.

  • To use a dataset, you must allocate at least 0.5 computing units (CUs) to the node.

Parameter

Description

Datasets

Specifies the dataset that can be accessed by the code of the current node.

  • If you use an OSS-based dataset, you must grant the DataWorks resource group permissions to access the OSS bucket that is configured for the dataset the first time you read data from the dataset.

  • If you use a NAS-based dataset, ensure the VPC of the DataWorks resource group is connected to the VPC of the NAS mount target. For configuration details, see Network connectivity solutions.

In this example, select the OSS dataset datasets-oss created in DataWorks and select version V1.

Mount Path

The path that the node's code uses to access the dataset. This field is automatically populated with the Default Mount Path configured when the dataset is defined.

Important

If you mount multiple datasets to the same node, their mount paths cannot conflict.

Advanced Settings

This parameter is optional. You can specify the tools and parameters for accessing OSS data or the configurations for accessing NAS file systems in the JSON format.

  • If the dataset that you configure is OSS-based, DataWorks uses ossfs 2.0 to access the OSS data in your dataset path by default. You can use Advanced Configuration to specify other tools to access OSS data. For details on available tools, see Advanced configuration examples. The following code shows the default configuration:

    {"mountOssType":"ossfs", "upload_concurrency":64} 
  • If you add a NAS-based dataset, you can specify the relevant configurations (the nasOptions parameter) for accessing an Apsara File Storage NAS (NAS) file system using the NFS protocol. The following code shows the default configuration. To customize parameter values, see Manually mount an NFS file system.

    Important
    • Only NAS file systems that use the NFS protocol can be mounted.

    • For NAS, only one advanced configuration parameter is supported: nasOptions. To customize NAS mount parameters, you can set Advanced Configuration to {"nasOptions":"<ParameterName1=ParameterValue>, <ParameterName2=ParameterValue>,..."}.

    {"nasOptions":"vers=3,nolock,proto=tcp,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport"}

Read Only

By default, you can read from and write to the dataset in the current node. If the dataset is set to read-only for this node, you cannot write data to its mount directory from the node code. Any write attempt will result in a permission error.

2. Use the dataset in a node

This example uses a Shell node. After you attach an OSS dataset to the Shell node, you can manage OSS data in the Shell node code as if they were local files. The following example uses the default ossfs 2.0 tool to write the file01.txt file to the mount path /mnt/data/dataset01 of the OSS dataset datasets-oss.

Sample code:

echo "Hello World" > /mnt/data/dataset01/file01.txt
ls -tl /mnt/data/dataset01

In the Debug Configurations panel on the right, open the Dataset tab and select the custom dataset datasets-oss/V1. Confirm the successful resource group authorization message, and ensure the Read-only switch is off. After you run the node, the log shows that file01.txt is created (12 bytes, permissions: rwxrwxrwx), the exit code is 0, the elapsed time is 0.741s, and the status is FINISH.

Note

If you receive the error message Job Submit Failed! submit job failed directly! Caused by: execute task failed, exception: [103:ILLEGAL_TASK]:Task with dataset need 0.5cu at least! during runtime, it means the task has insufficient CUs. Increase the CU allocation for the resource group to at least 0.5.

3. Verify the data in OSS

After the code in 2. Use the dataset in a node runs, the file is automatically written to the OSS storage path that corresponds to the dataset's mount path. You can view the file in the OSS storage path. In this example, the mount path /mnt/data/dataset01 of the OSS dataset datasets-oss maps to oss://datasets-oss.oss-cn-shanghai.aliyuncs.com/dataset01/v1/. The following figure shows an example of the data written to the OSS path.

In this path, you can find the written file file01.txt (0.012 KB, Standard storage).

Use datasets in a personal development environment

Once a dataset is defined, you can mount it to a personal development environment instance when you create or modify the instance. You can then access the dataset's data directly in the terminal or a Notebook within your personal directory.

Prerequisites: create a dataset

  1. Create a bucket or create a file system.

  2. Create a dataset.

    This example uses a NAS-based dataset. Create a NAS dataset named datasets-nas in the China (Shanghai) region, and mount the NAS path nas://****.cn-shanghai.nas.aliyuncs.com/mnt/dataset02/v1/ to /mnt/data/dataset02.

1. Configure a dataset for the personal environment

Create an instance of a personal development environment and select the existing NAS dataset datasets-nas.

On the configuration page, select datasets-nas from the Dataset drop-down list and specify its corresponding mount path.

Parameter

Description

Datasets

Specifies the dataset that the instance's code can access. Ensure the VPC selected for the personal development environment instance can connect to the NAS mount target.

In this example, select the NAS dataset datasets-nas created in DataWorks and select version V1.

Mount Path

The path that the instance's code uses to access the dataset.

In this example, mount the NAS dataset path nas://****.cn-shanghai.nas.aliyuncs.com/mnt/dataset02/v1/ to /mnt/data/dataset02.

Important

If you mount multiple datasets to the same instance of a personal development environment, their mount paths must not conflict.

Advanced Settings

This parameter is optional. You can use JSON to specify the configurations (the nasOptions parameter) for accessing an Apsara File Storage NAS (NAS) file system that uses the NFS protocol. The following code shows the default configuration. You can also refer to Manually mount an NFS file system to customize parameter values.

Important
  • Only NAS file systems that use the NFS protocol can be mounted.

  • For NAS, only one advanced configuration parameter is supported: nasOptions. To customize NAS mount parameters, you can set Advanced Configuration to {"nasOptions":"<ParameterName1=ParameterValue>,<ParameterName2=ParameterValue>,..."}.

{"nasOptions":"vers=3,nolock,proto=tcp,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport"}

Read Only

By default, you can read from and write to the dataset in the instance. If the dataset is set to read-only for the instance, you cannot write data to its mount directory from code within the instance. Any write attempt will result in a permission error.

2. Use the dataset in a Notebook

  1. At the top of the DataStudio page, switch to the personal development environment instance, and then create a Notebook for basic development.

  2. In the Notebook, add the following content.

    1. Write data to the specified path in the dataset.

      import os
      # Define the destination path and file name
      file_path = "/mnt/data/dataset02/file02.txt"
      # Make sure that the directory exists. If the directory does not exist, create it.
      os.makedirs(os.path.dirname(file_path),exist_ok=True)
      # Write the content
      content = "Hello World!"
      try:
          with open(file_path, "w", encoding="utf-8") as file:
              file.write(content)
          print(f"The file is successfully written to {file_path}")
      except Exception as e:
          print(f"Failed to write the file: {str(e)}")
    2. Read data from the specified path in the dataset.

      file_path = "/mnt/data/dataset02/file02.txt"
      with open(file_path, "r") as file:
          content = file.read()
      content
  3. Run the two Python code snippets separately.

    Note

    Before running the code, ensure you have selected the correct Python kernel. This example uses Python 3.11.9.

    First code snippet: Write data to the specified path in the dataset.

    import os
    # Define the destination path and file name
    file_path = "/mnt/data/dataset02/file02.txt"
    # Make sure that the directory exists. If the directory does not exist, create it.
    os.makedirs(os.path.dirname(file_path),exist_ok=True)
    # Write the content
    content = "Hello World!"
    try:
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(content)
            print(f"The file is successfully written to {file_path}")
    except Exception as e:
        print(f"Failed to write the file: {str(e)}")

    If the output is The file is successfully written to /mnt/data/dataset02/file02.txt, the write operation is successful. Second code snippet: Read data from the specified path in the dataset.

    file_path = "/mnt/data/dataset02/file02.txt"
    with open(file_path, "r") as file:
        content = file.read()
        content

    If the output is 'Hello World!', the read operation is successful.

3. Configure scheduling

On the right side of the Notebook node, click Scheduling Settings and add the dataset options. Use the same parameters as those configured for the personal development environment instance.

Advanced configuration examples

When you configure a dataset, you can set advanced configurations to customize relevant parameters in the JSON format:

Mount OSS with ossfs 2.0

ossfs 2.0 is a client that mounts OSS for high-performance access. It delivers high sequential read/write throughput and fully utilizes OSS bandwidth. It is suitable for compute-intensive applications that require high sequential I/O performance, such as AI training and big data processing. These workload scenarios mainly involve sequential and random reads and sequential (append-only) writes, and do not require full POSIX semantics.

In the DatasetsAdvanced Settings, you can set advanced parameters. Separate multiple options with a comma (,). For instructions on how to use advanced parameters and for more configuration options, see ossfs 2.0 mount options. The following examples show several common scenarios:

  • Immutable data source: If all files that you read are not modified during a task, you can set a long cache time to reduce the number of metadata requests. A typical scenario is reading a batch of existing files and generating a new batch of files after processing.

    {"mountOssType":"ossfs", "attr_timeout": "7200"}
  • Fast read and write operations: Use a short metadata cache time to balance cache efficiency and data timeliness.

    {"mountOssType":"ossfs", "attr_timeout": "3", "negative_timeout":"0"}
  • Consistent view for distributed tasks: By default, ossfs updates file data based on the metadata cache. Use the following configuration to achieve a synchronized view across multiple nodes.

    { "mountOssType":"ossfs","negative_timeout": "0", "close_to_open":"false"}
  • OOM caused by opening too many files: High task concurrency with a large number of simultaneously open files may cause out-of-memory (OOM) issues. Use the following configuration to alleviate memory pressure.

    {"mountOssType":"ossfs","readdirplus": "false", "inode_cache_eviction_threshold":"300000"}

Mount OSS with ossfs 1.0

ossfs 1.0 mounts an OSS bucket as a local file system on a Linux system. Compared with ossfs 2.0, ossfs 1.0 provides more comprehensive support for file operations. If you encounter file operation incompatibilities with ossfs 2.0, try using ossfs 1.0 instead.

For more information about the parameters required for mounting with ossfs 1.0, see ossfs 1.0 mount options.

Mount OSS with JindoFuse

You can use the JindoFuse component to mount an OSS dataset to a specified path in a container. This tool is suitable for the following scenarios:

  • You want to read OSS data as if it were a local dataset, or the dataset is small enough to benefit from the local cache acceleration of JindoFuse.

  • You need to write data to OSS.

In the DatasetsAdvanced Settings, you can set advanced parameters. Separate multiple options with a comma (,). The following code shows only an example. For parameter descriptions and more configuration options, see JindoFuse User Guide and Use JindoFuse to mount and access data.

Note

Currently, DataWorks supports only parameters in the key=value format.

{ 
  "mountOssType":"jindofuse",
  "fs.oss.download.thread.concurrency": "2 × number of CPU cores",
  "fs.oss.upload.thread.concurrency": "2 × number of CPU cores",
  "attr_timeout": 3,
  "entry_timeout": 0,
  "negative_timeout": 0
}

Use a NAS dataset

For NAS-based datasets, you can specify configurations (using the nasOptions parameter) for accessing an NFS-based Apsara File Storage NAS (NAS) file system. The following code shows the default configuration. To customize parameter values, see Manually mount an NFS file system.

Important
  • Only NAS file systems that use the NFS protocol can be mounted.

  • For NAS, only one advanced configuration parameter is supported: nasOptions. To customize NAS mount parameters, you can set Advanced Configuration to {"nasOptions":"<ParameterName1=ParameterValue>, <ParameterName2=ParameterValue>,..."}.

{"nasOptions":"vers=3,nolock,proto=tcp,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport"}

Related documents