DataWorks datasets let you register and version unstructured data -- images, documents, audio, and other files -- stored in OSS or NAS. Once registered, the data is mounted into your DataWorks development environment and accessible from Shell nodes, Python nodes, notebooks, and your personal development environment.
Use cases
ML training data: Register image or document collections stored in OSS, version them as training sets evolve, and mount them directly into notebooks for model development.
ETL landing zones: Point a dataset at a NAS folder where upstream systems deposit raw files, then process those files in Shell or Python nodes.
Unstructured data pipelines: Access audio, video, or PDF files through a consistent mount path across multiple DataWorks tasks.
Reproducible experiments: Create dataset versions to capture point-in-time snapshots. If a new version introduces issues, revert to a previous version without rebuilding the data pipeline.
Prerequisites
Before you begin, make sure you have:
A DataWorks workspace
An OSS bucket or NAS file system in the same region as your workspace
(OSS) The required OSS bucket permissions
(NAS) A mount target configured with VPC connectivity to your DataWorks resource group
Storage type comparison
Datasets support two storage backends. Choose based on your access pattern and existing infrastructure.
| Dimension | OSS | NAS |
|---|---|---|
| Storage type | Object storage (flat namespace) | POSIX-compliant file system |
| Best for | Large collections of immutable files (images, models, archives) | Workloads that require random read/write or shared file access |
| File system options | N/A | General-purpose NAS or Extreme NAS |
| Default mount path | /mnt/data/ | /mnt/data/ |
| Network requirements | OSS bucket permissions | VPC connectivity between NAS mount target and resource group |
Note: The console also supports Data Lake Formation (DLF) as a storage type. For more information, see the DataWorks console.
Create a dataset
Log on to the DataWorks console. In the top navigation bar, select the desired region.
In the left-side navigation pane, choose Data Governance > Data Map. On the page that appears, click Go to Data Map.
In the left navigation pane of the Data Map page, click Data Catalog (
). In the Directory List, click DataSet.Find the workspace in which to create the dataset and click its name. The dataset list for the workspace appears.
Click Create Dataset and configure the settings for your chosen storage type.
Click Save to create the dataset.
OSS dataset configuration
Dataset configuration:
| Setting | Description |
|---|---|
| Storage type | OSS |
| Content type | (Optional) The type of data being registered. Defaults to Common. |
Import configuration:
| Setting | Description |
|---|---|
| OSS path | The path of the OSS folder to mount. Make sure you have the required OSS bucket permissions. |
| Default mount path | The path used to access the data in DataWorks. Defaults to /mnt/data/. Change this value if needed. |
NAS dataset configuration
Dataset configuration:
| Setting | Description |
|---|---|
| Storage type | General-purpose NAS or Extreme NAS |
| Content type | (Optional) The type of data being registered. Defaults to Common. |
Import configuration:
| Setting | Description |
|---|---|
| File system | Select the NAS file system created in the current region under your Alibaba Cloud account. |
| File system mount target | Select a mount target to access the NAS file system. The VPC of the mount target must be connected to the VPC of the resource group. Use the same VPC for both to ensure connectivity, or see Network connectivity solutions for cross-VPC scenarios. |
| File system path | The path of the NAS folder to mount. Defaults to the root directory /. This path must exist in the NAS file system; otherwise, the dataset fails when used. |
| Default mount path | The path used to access NAS data from DataWorks. Defaults to /mnt/data/. Change this value if needed. |
Manage datasets
To manage an existing dataset, navigate to Data Catalog > DataSet, select the workspace, and click Details in the Actions column for the target dataset.
The dataset details page shows the Attribute Information and Dataset Version sections.
Create a version
Click Create Version in the upper-right corner. When creating a new version, customize the OSS Path or NAS file system configuration and set the Default Mount Path.
Versioning captures a point-in-time snapshot of your dataset configuration. This supports:
Reproducibility: Pin a specific version to a training job so results stay consistent.
Rollback: Revert to a previous version if the current one introduces data quality issues.
Auditability: Track which version was used in each pipeline run.
View dataset data (OSS only)
Click the View Data tab, then click View in OSS to open the storage path for the selected version in the OSS console.
Delete a version
In the Dataset Version section, select the version from the drop-down menu, then click Delete.
Delete a dataset
Click Delete in the upper-right corner of the dataset details page.
Deleting a dataset or a dataset version does not delete the original files in OSS or NAS. However, the deleted dataset or version cannot be recovered from the DataWorks dataset feature. Proceed with caution.
Use a dataset
After creating a dataset, access it from the following DataWorks development tools through the configured mount path (default: /mnt/data/):
For detailed instructions, see Use a dataset.
Limitations
The dataset feature is currently in beta. The final features and stability may vary.
Billing
The dataset feature itself is free of charge. However, the underlying storage incurs fees:
OSS: Storage and network access fees. See OSS billing.
NAS: Storage and network access fees. See NAS billing.