All Products
Search
Document Center

DataWorks:Manage datasets

Last Updated:Feb 28, 2026

DataWorks datasets let you register and version unstructured data -- images, documents, audio, and other files -- stored in OSS or NAS. Once registered, the data is mounted into your DataWorks development environment and accessible from Shell nodes, Python nodes, notebooks, and your personal development environment.

Use cases

  • ML training data: Register image or document collections stored in OSS, version them as training sets evolve, and mount them directly into notebooks for model development.

  • ETL landing zones: Point a dataset at a NAS folder where upstream systems deposit raw files, then process those files in Shell or Python nodes.

  • Unstructured data pipelines: Access audio, video, or PDF files through a consistent mount path across multiple DataWorks tasks.

  • Reproducible experiments: Create dataset versions to capture point-in-time snapshots. If a new version introduces issues, revert to a previous version without rebuilding the data pipeline.

Prerequisites

Before you begin, make sure you have:

  • A DataWorks workspace

  • An OSS bucket or NAS file system in the same region as your workspace

  • (OSS) The required OSS bucket permissions

  • (NAS) A mount target configured with VPC connectivity to your DataWorks resource group

Storage type comparison

Datasets support two storage backends. Choose based on your access pattern and existing infrastructure.

DimensionOSSNAS
Storage typeObject storage (flat namespace)POSIX-compliant file system
Best forLarge collections of immutable files (images, models, archives)Workloads that require random read/write or shared file access
File system optionsN/AGeneral-purpose NAS or Extreme NAS
Default mount path/mnt/data//mnt/data/
Network requirementsOSS bucket permissionsVPC connectivity between NAS mount target and resource group
Note: The console also supports Data Lake Formation (DLF) as a storage type. For more information, see the DataWorks console.

Create a dataset

  1. Log on to the DataWorks console. In the top navigation bar, select the desired region.

  2. In the left-side navigation pane, choose Data Governance > Data Map. On the page that appears, click Go to Data Map.

  3. In the left navigation pane of the Data Map page, click Data Catalog (image). In the Directory List, click DataSet.

  4. Find the workspace in which to create the dataset and click its name. The dataset list for the workspace appears.

  5. Click Create Dataset and configure the settings for your chosen storage type.

  6. Click Save to create the dataset.

OSS dataset configuration

Dataset configuration:

SettingDescription
Storage typeOSS
Content type(Optional) The type of data being registered. Defaults to Common.

Import configuration:

SettingDescription
OSS pathThe path of the OSS folder to mount. Make sure you have the required OSS bucket permissions.
Default mount pathThe path used to access the data in DataWorks. Defaults to /mnt/data/. Change this value if needed.

NAS dataset configuration

Dataset configuration:

SettingDescription
Storage typeGeneral-purpose NAS or Extreme NAS
Content type(Optional) The type of data being registered. Defaults to Common.

Import configuration:

SettingDescription
File systemSelect the NAS file system created in the current region under your Alibaba Cloud account.
File system mount targetSelect a mount target to access the NAS file system. The VPC of the mount target must be connected to the VPC of the resource group. Use the same VPC for both to ensure connectivity, or see Network connectivity solutions for cross-VPC scenarios.
File system pathThe path of the NAS folder to mount. Defaults to the root directory /. This path must exist in the NAS file system; otherwise, the dataset fails when used.
Default mount pathThe path used to access NAS data from DataWorks. Defaults to /mnt/data/. Change this value if needed.

Manage datasets

To manage an existing dataset, navigate to Data Catalog > DataSet, select the workspace, and click Details in the Actions column for the target dataset.

The dataset details page shows the Attribute Information and Dataset Version sections.

Create a version

Click Create Version in the upper-right corner. When creating a new version, customize the OSS Path or NAS file system configuration and set the Default Mount Path.

Versioning captures a point-in-time snapshot of your dataset configuration. This supports:

  • Reproducibility: Pin a specific version to a training job so results stay consistent.

  • Rollback: Revert to a previous version if the current one introduces data quality issues.

  • Auditability: Track which version was used in each pipeline run.

View dataset data (OSS only)

Click the View Data tab, then click View in OSS to open the storage path for the selected version in the OSS console.

Delete a version

In the Dataset Version section, select the version from the drop-down menu, then click Delete.

Delete a dataset

Click Delete in the upper-right corner of the dataset details page.

Important

Deleting a dataset or a dataset version does not delete the original files in OSS or NAS. However, the deleted dataset or version cannot be recovered from the DataWorks dataset feature. Proceed with caution.

Use a dataset

After creating a dataset, access it from the following DataWorks development tools through the configured mount path (default: /mnt/data/):

For detailed instructions, see Use a dataset.

Limitations

The dataset feature is currently in beta. The final features and stability may vary.

Billing

The dataset feature itself is free of charge. However, the underlying storage incurs fees: