Datasets let you mount Object Storage Service (OSS) or Apsara File Storage NAS (NAS) storage to DataWorks nodes and personal development environments as local file system paths — so you can read and write data using standard file I/O, without handling storage credentials or mount scripts in your code.
Use cases
-
AI training and data processing: Mount large OSS datasets to Shell or Python nodes and read them at high sequential throughput using ossfs 2.0.
-
Notebook-based data exploration: Access NAS or OSS data directly from a Notebook in your personal development environment, using standard Python file APIs.
-
Version-controlled data pipelines: Use dataset versioning to track data across pipeline runs, switch to an earlier version when a new version causes issues, and ensure reproducible results.
-
Shared raw data landing zones: Mount a read-only dataset to prevent accidental overwrites when multiple nodes consume the same source data.
Limitations
-
Datasets are supported only in the new version of DataStudio.
-
Resource group: Access datasets in data development nodes only through Serverless resource groups.
-
Supported objects: Datasets work with Shell nodes, Python nodes, Notebooks, and personal development environments. Mount up to 5 datasets per object.
-
Storage: Datasets support OSS and Apsara File Storage NAS using the NFS protocol.
-
Permissions: If a dataset mount target is read-only, any attempt to modify or delete files in the dataset returns a permission error.
Use datasets in nodes
This section walks through mounting an OSS dataset to a Shell node. The OSS path oss://datasets-oss.oss-cn-shanghai.aliyuncs.com/dataset01/v1/ is mounted to /mnt/data/dataset01, and data is written to the mount path from node code.
Prerequisites
Before you begin, make sure that you have:
-
An OSS bucket or NAS file system. Create a bucket or create a file system.
-
A dataset created in DataWorks. Create a dataset. This example uses an OSS dataset named
datasets-oss, created in the China (Shanghai) region, with the bucket pathoss://datasets-oss.oss-cn-shanghai.aliyuncs.com/dataset01/v1/mounted to/mnt/data/dataset01.
Step 1: Configure the dataset for a node
Open a Shell or Python node and add the datasets-oss dataset under Debug Configurations.
-
Before you publish, add and sync the dataset under Scheduling Configuration.
-
Allocate at least 0.5 computing units (CUs) to the node to use a dataset.
| Parameter | Description |
|---|---|
| Dataset | The dataset to access from the node's code. For an OSS dataset, grant the DataWorks Serverless resource group permission to access the OSS bucket the first time you read data. For a NAS dataset, make sure the virtual private cloud (VPC) of the resource group is connected to the VPC of the NAS mount target. See Overview of network connection solutions. In this example, select the datasets-oss dataset and the V1 version. |
| Mount Path | The path the node code uses to access the dataset. Pre-filled from the Default Mount Path in the dataset definition. If multiple datasets are mounted to the same node, each must have a unique mount path. |
| Advanced Configuration | Optional. Specify the OSS access tool and parameters, or NAS mount parameters, in JSON format. For OSS datasets, ossfs 2.0 is used by default: {"mountOssType":"ossfs", "upload_concurrency":64}. For NAS datasets, the default is: {"nasOptions":"vers=3,nolock,proto=tcp,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport"}. See Advanced configuration examples. |
| Read-only | By default, the node can read and write the dataset. Enable Read-only to prevent write operations; any write attempt returns a permission error. |
Step 2: Write data from the node
After attaching the dataset, interact with OSS data the same way you interact with local files. This example uses the default ossfs 2.0 tool to write file01.txt to the mount path /mnt/data/dataset01.
echo "Hello World" > /mnt/data/dataset01/file01.txt
ls -tl /mnt/data/dataset01
If the error Job Submit Failed! submit job failed directly! Caused by: execute task failed, exception: [103:ILLEGAL_TASK]:Task with dataset need 0.5cu at least! appears, the node has insufficient CUs. Increase the resource group CU allocation to at least 0.5 CUs.
Step 3: Verify data in OSS
After the node runs, the file is automatically written to the OSS path that corresponds to the dataset mount path. In this example, /mnt/data/dataset01 maps to oss://datasets-oss.oss-cn-shanghai.aliyuncs.com/dataset01/v1/. Navigate to that OSS path to confirm the file was written.
Use datasets in a personal development environment
Mount a dataset to a personal development environment instance to access NAS or OSS data directly from the terminal or a Notebook in your personal folder.
Prerequisites
Before you begin, make sure that you have:
-
An OSS bucket or NAS file system. Create a bucket or create a file system.
-
A dataset created in DataWorks. Create a dataset. This example uses a NAS dataset named
datasets-nas, created in China (Shanghai), with the pathnas://****.cn-shanghai.nas.aliyuncs.com/mnt/dataset02/v1/mounted to/mnt/data/dataset02.
Step 1: Configure the dataset for the personal development environment
Create a personal development environment instance and select the datasets-nas NAS dataset.
| Parameter | Description |
|---|---|
| Dataset | The dataset to access from this environment. Make sure the VPC selected for the instance can connect to the NAS mount target. In this example, select the datasets-nas NAS dataset and the V1 version. |
| Mount Path | The path to access the dataset in code. In this example, nas://****.cn-shanghai.nas.aliyuncs.com/mnt/dataset02/v1/ is mounted to /mnt/data/dataset02. If multiple datasets are mounted to the same instance, each must have a unique mount path. |
| Advanced Configuration | Optional. Specify NAS file system access parameters (nasOptions) in JSON format. Default: {"nasOptions":"vers=3,nolock,proto=tcp,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport"}. To customize, set {"nasOptions":"<ParameterName1=ParameterValue>,<ParameterName2=ParameterValue>,..."}. See Manually mount an NFS file system for parameter details. Only NAS file systems using the NFS protocol can be mounted, and nasOptions is the only supported advanced configuration parameter for NAS. |
| Read-only | By default, the instance can read and write the dataset. Enable Read-only to prevent write operations; any write attempt returns a permission error. |
Step 2: Read and write the dataset in a Notebook
-
At the top of the Data Development page, switch to your personal development environment instance and create a Notebook.
-
Write data to the dataset:
import os # Define the destination path and file name. file_path = "/mnt/data/dataset02/file02.txt" # Create the folder if it doesn't exist. os.makedirs(os.path.dirname(file_path), exist_ok=True) # Write the content. content = "Hello World!" try: with open(file_path, "w", encoding="utf-8") as file: file.write(content) print(f"The file has been successfully written to: {file_path}") except Exception as e: print(f"Write failed: {str(e)}") -
Read data from the dataset:
file_path = "/mnt/data/dataset02/file02.txt" with open(file_path, "r") as file: content = file.read() content -
Run the two code blocks separately.
NoteConfirm the Python kernel when prompted. This example uses Python 3.11.9.

Step 3: Configure scheduling
On the right side of the Notebook node, click Scheduling Settings and add the dataset options. The parameters must match the dataset settings you configured when creating the personal development environment instance.
Advanced configuration examples
Both node datasets and personal development environment datasets support advanced configurations in JSON format:
-
Node datasets: Specify the OSS access tool and parameters, or NAS mount parameters.
-
Personal development environment datasets: Specify NAS mount parameters only.
Use ossfs 2.0 (default for OSS)
ossfs 2.0 provides high-performance sequential read and write access to OSS by mounting. It is the default tool for OSS datasets and suits AI training, data processing, and other workloads with sequential I/O patterns.
Set advanced parameters under . Separate multiple options with a comma. For the full parameter list, see ossfs 2.0 mount options.
Stable data source — If the files being read are not modified during the task, set a long cache time to reduce metadata requests. Typical use case: read a batch of existing files and produce new output files.
{"mountOssType":"ossfs", "attr_timeout": "7200"}
Fast read and write — Use a short metadata cache time to balance cache efficiency with data freshness.
{"mountOssType":"ossfs", "attr_timeout": "3", "negative_timeout":"0"}
Consistent view across distributed nodes — By default, ossfs updates file data based on the metadata cache. Use this configuration to get a synchronized view across multiple nodes.
{"mountOssType":"ossfs", "negative_timeout": "0", "close_to_open":"false"}
High-concurrency tasks with OOM risk — If many files are opened concurrently and cause out-of-memory (OOM) errors, reduce memory pressure with this configuration.
{"mountOssType":"ossfs", "readdirplus": "false", "inode_cache_eviction_threshold":"300000"}
Use ossfs 1.0
ossfs 1.0 mounts an OSS bucket as a local file system on Linux and provides broader file operation support than ossfs 2.0. If you encounter file operation compatibility issues with ossfs 2.0, switch to ossfs 1.0.
For mount parameters, see ossfs 1.0 mount options.
Use JindoFuse
JindoFuse mounts an OSS dataset to a specified path in a container and is suited for:
-
Reading OSS data as if it were a local dataset, especially when the dataset is small enough to benefit from JindoFuse's local cache.
-
Writing data to OSS.
Set advanced parameters under . Separate multiple configurations with a comma.
DataWorks only supports parameters in key=value format.
{
"mountOssType":"jindofuse",
"fs.oss.download.thread.concurrency": "2 × number of CPU cores",
"fs.oss.upload.thread.concurrency": "2 × number of CPU cores",
"attr_timeout": 3,
"entry_timeout": 0,
"negative_timeout": 0
}
For parameter descriptions and additional options, see JindoFuse user guide and Using JindoFuse to mount and access data.
Use a NAS dataset
For NAS datasets, specify the NFS mount parameters using the nasOptions parameter.
-
Only NAS file systems using the NFS protocol can be mounted.
-
nasOptionsis the only supported advanced configuration parameter for NAS. To customize, set{"nasOptions":"<ParameterName1=ParameterValue>,<ParameterName2=ParameterValue>,..."}.
Default configuration:
{"nasOptions":"vers=3,nolock,proto=tcp,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport"}
To customize parameter values, see Manually mount an NFS file system.