Register OSS and NAS Datasets for ML Pipelines - DataWorks

DataWorks datasets let you register and version unstructured data -- images, documents, audio, and other files -- stored in OSS or NAS. Once registered, the data is mounted into your DataWorks development environment and accessible from Shell nodes, Python nodes, notebooks, and your personal development environment.

Use cases

ML training data: Register image or document collections stored in OSS, version them as training sets evolve, and mount them directly into notebooks for model development.
ETL landing zones: Point a dataset at a NAS folder where upstream systems deposit raw files, then process those files in Shell or Python nodes.
Unstructured data pipelines: Access audio, video, or PDF files through a consistent mount path across multiple DataWorks tasks.
Reproducible experiments: Create dataset versions to capture point-in-time snapshots. If a new version introduces issues, revert to a previous version without rebuilding the data pipeline.

Prerequisites

Before you begin, make sure you have:

A DataWorks workspace
An OSS bucket or NAS file system in the same region as your workspace
(OSS) The required OSS bucket permissions
(NAS) A mount target configured with VPC connectivity to your DataWorks resource group

Storage type comparison

Datasets support two storage backends. Choose based on your access pattern and existing infrastructure.

Dimension	OSS	NAS
Storage type	Object storage (flat namespace)	POSIX-compliant file system
Best for	Large collections of immutable files (images, models, archives)	Workloads that require random read/write or shared file access
File system options	N/A	General-purpose NAS or Extreme NAS
Default mount path	`/mnt/data/`	`/mnt/data/`
Network requirements	OSS bucket permissions	VPC connectivity between NAS mount target and resource group

Note: The console also supports Data Lake Formation (DLF) as a storage type. For more information, see the DataWorks console.

Create a dataset

Log on to the DataWorks console. In the top navigation bar, select the desired region.
In the left-side navigation pane, choose Data Governance > Data Map. On the page that appears, click Go to Data Map.
In the left navigation pane of the Data Map page, click Data Catalog (). In the Directory List, click DataSet.
Find the workspace in which to create the dataset and click its name. The dataset list for the workspace appears.
Click Create Dataset and configure the settings for your chosen storage type.
Click Save to create the dataset.

OSS dataset configuration

Dataset configuration:

Setting	Description
Storage type	OSS
Content type	(Optional) The type of data being registered. Defaults to Common.

Import configuration:

Setting	Description
OSS path	The path of the OSS folder to mount. Make sure you have the required OSS bucket permissions.
Default mount path	The path used to access the data in DataWorks. Defaults to `/mnt/data/`. Change this value if needed.

NAS dataset configuration

Dataset configuration:

Setting	Description
Storage type	General-purpose NAS or Extreme NAS
Content type	(Optional) The type of data being registered. Defaults to Common.

Import configuration:

Setting	Description
File system	Select the NAS file system created in the current region under your Alibaba Cloud account.
File system mount target	Select a mount target to access the NAS file system. The VPC of the mount target must be connected to the VPC of the resource group. Use the same VPC for both to ensure connectivity, or see Network connectivity solutions for cross-VPC scenarios.
File system path	The path of the NAS folder to mount. Defaults to the root directory `/`. This path must exist in the NAS file system; otherwise, the dataset fails when used.
Default mount path	The path used to access NAS data from DataWorks. Defaults to `/mnt/data/`. Change this value if needed.

Manage datasets

To manage an existing dataset, navigate to Data Catalog > DataSet, select the workspace, and click Details in the Actions column for the target dataset.

The dataset details page shows the Attribute Information and Dataset Version sections.

Create a version

Click Create Version in the upper-right corner. When creating a new version, customize the OSS Path or NAS file system configuration and set the Default Mount Path.

Versioning captures a point-in-time snapshot of your dataset configuration. This supports:

Reproducibility: Pin a specific version to a training job so results stay consistent.
Rollback: Revert to a previous version if the current one introduces data quality issues.
Auditability: Track which version was used in each pipeline run.

View dataset data (OSS only)

Click the View Data tab, then click View in OSS to open the storage path for the selected version in the OSS console.

Delete a version

In the Dataset Version section, select the version from the drop-down menu, then click Delete.

Delete a dataset

Click Delete in the upper-right corner of the dataset details page.

Important

Deleting a dataset or a dataset version does not delete the original files in OSS or NAS. However, the deleted dataset or version cannot be recovered from the DataWorks dataset feature. Proceed with caution.

Use a dataset

After creating a dataset, access it from the following DataWorks development tools through the configured mount path (default: /mnt/data/):

For detailed instructions, see Use a dataset.

Limitations

The dataset feature is currently in beta. The final features and stability may vary.

Billing

The dataset feature itself is free of charge. However, the underlying storage incurs fees:

OSS: Storage and network access fees. See OSS billing.
NAS: Storage and network access fees. See NAS billing.