The dataset feature in DataWorks lets you manage unstructured data, such as images and documents, for use within DataWorks. This topic describes how to create and use datasets.
Background
When you develop data in DataWorks, you can use the dataset feature to read and write data stored in OSS and NAS. This feature supports the creation and management of datasets and their versions. Version management lets you track data versions and quickly revert to a previous version if a new one has issues. This helps ensure that your business operations run smoothly.
Precautions
The dataset feature is currently in beta. The final features and stability may vary.
Billing
The DataWorks dataset feature is free of charge. However, storing data in OSS or NAS incurs storage and network access fees. For more information, see OSS billing and NAS billing.
Create a dataset
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, click Go to Data Map.
In the navigation pane on the left of the Data Map page, click Data Catalog (
) to open the Data Catalog page. In the Directory List, click Dataset Catalog.Find the workspace in which you want to create a dataset and click its name. This action opens the dataset details page for the workspace, which displays all existing datasets. Click the Create Dataset button and follow the instructions to create a DataWorks dataset.
Storage class: OSS
Dataset configuration:
Configuration item
Description
Storage class
OSS
Content type
Select the type of data you are registering. This is optional. The default is General.
Import configuration:
Configuration item
Description
OSS path
Specify the path of the OSS folder to mount.
NoteMake sure you have the required OSS Bucket permissions.
Default mount path
Specify the default mount path for the OSS folder. You can use this path to access the data in DataWorks. The system default is
/mnt/data/. You can change the mount path manually.
Storage class: NAS
Dataset configuration:
Configuration item
Description
Storage class
Select File Storage (General-purpose NAS file systems) or File Storage (Extreme NAS file systems)
Content type
Select the type of data you are registering. This is optional. The default is General.
Import configuration:
Configuration item
Configuration description
File system
Select the destination NAS file system created in the current region under your Alibaba Cloud account.
File system mount target
Configure a mount target to access the NAS file system.
ImportantMake sure the VPC of the mount target is connected to the VPC of the resource group:
Use the same VPC for the NAS mount target and the resource group to ensure network connectivity.
For other scenarios, see Network connectivity solutions to connect the VPC of the NAS mount target to the VPC configured for the resource group.
File system path
Specify the path of the NAS folder to mount. The default is the root directory
/. Make sure this path exists in the NAS file system. Otherwise, an error occurs when you use the dataset.Default mount path
Specify the default mount path in the dataset for the NAS folder. You can then use this path to access the data in the NAS path from DataWorks. The system default is
/mnt/data/. You can change the mount path manually.
Manage datasets
In , navigate to the dataset list of the destination workspace. In the Operation column of the dataset that you want to manage, click Details. This action opens the dataset details page. On this page, you can view the Overview and Dataset Version information and perform the following operations:
Create Version: Click the Create Version button in the upper-right corner to open the version creation page. When you create a new version, you can customize the OSS Path or NAS File System Configuration and set the Default Mount Path.
Delete Dataset: Click the Delete button in the upper-right corner of the dataset details page to delete the dataset.
View Dataset Data: This operation is supported only for Object Storage Service (OSS) datasets. In the Dataset Version section, select the desired version from the drop-down menu next to the title. Then, click View In OSS. You will be redirected to the storage path for that version in the OSS console.
Delete Version: In the Dataset Version section, select the desired version from the drop-down menu next to the title. Then, click the Delete button.
Deleting a dataset or a dataset version does not delete the original files. However, the deleted dataset or version cannot be recovered from the DataWorks dataset feature. Proceed with caution.
Use a dataset
You can use datasets that you create in Data Studio, such as Shell nodes, Python node, and Notebook development, and in your personal development environment.
For more information, see Use a dataset.