Fluid supports sub-datasets

What is Fluid?

Running tasks such as AI and big data on the cloud through the cloud-native architecture can enjoy the advantages of computing resource elasticity, but at the same time, it will also encounter the data access delay caused by the separation of computing and storage architecture and the high bandwidth overhead of remotely pulling data. challenge. Especially in GPU deep learning training scenarios, iterative remote reading of a large amount of training data will seriously slow down GPU computing efficiency.

On the other hand, Kubernetes only provides a standard interface (CSI, Container Storage Interface) for heterogeneous storage service access and management, and does not define how applications use and manage data in container clusters. When running training tasks, data scientists need to be able to manage dataset versions, control access permissions, dataset preprocessing, and accelerate heterogeneous data reading, etc. But there is no such standard solution in Kubernetes, which is one of the important capabilities missing from the cloud-native container community.

Fluid abstracts the "process of using data for computing tasks", proposes the concept of elastic dataset Dataset, and implements it in Kubernetes as a "first class citizen". Fluid creates a data orchestration and acceleration system around the elastic dataset Dataset to realize the capabilities of Dataset management (CRUD operation), permission control and access acceleration.

There are two core concepts in Fluid: Dataset and Runtime.

• Dataset refers to a data set, which is a collection of logically related data, which will be used by computing engines, such as Spark, TensorFlow, PyTorch, etc.
• Runtime refers to the system that provides distributed caching. Currently, the runtime types supported by Fluid include JuiceFS, Alluxio, JindoFS, and GooseFS. Alluxio and JindoFS are typical distributed caching engines; JuiceFS is a distributed file system with Distributed cache capability. These cache systems use the storage resources (such as memory and disk) on the nodes in the Kubernetes cluster as the computing-side cache of the remote storage system.

Why does Fluid need to support subdatasets (subDataset)?

The earliest mode of Fluid supports a Dataset to monopolize a Runtime by default, which can be understood as a dedicated cache cluster acceleration for a Dataset. It can be customized and optimized according to the characteristics of the data set, such as the size of a single file, the number of files, and the number of clients; and a separate caching system is provided. It can provide the best performance and stability, and will not interfere with each other, but its problem is that hardware resources are wasted, and a cache system needs to be deployed for each different data set; at the same time, operation and maintenance are complex, and multiple cache runtimes need to be managed. In fact, this mode is essentially a single-tenant architecture, which is suitable for scenarios with high requirements on data access throughput and latency.

Of course, with the deepening of the use of Fluid, there are also different requirements. One of the common needs of the community is:

1. The dataset cache can be accessed across namespaces
2. Only allow users to access a certain subdirectory of the dataset
Especially JuiceFS users, they tend to use Dataset to point to the root directory of JuiceFS. Then assign different subdirectories to different data scientist groups as different data sets, and hope that the data sets between each other are invisible; at the same time, it also supports the tightening of permissions for sub-datasets. For example, the root data set supports reading and writing, and the sub-datasets can Tighten to read-only.
This article uses AlluxioRuntime as an example to demonstrate how to use Fluid to support sub-dataset capabilities.
Example of use:

Imagine that user A creates a dataset spark under the Kubernetes namespace spark, which contains three versions of spark. It is hoped that team A can only see the dataset spark-3.1.3, and team B can only see the dataset spark-3.3.1.

This allows different teams to access different sub-datasets without visibility into each other's data.
Phase 1: The administrator creates the dataset, pointing to the root directory

1. Before running this example, please refer to the installation document to complete the installation and check that each component of Fluid is running normally

2. Create namespace spark

3. Create Dataset and AlluxioRuntime in the namespace development, where the dataset is readable and writable

4. Check the status of the dataset

5. Create a Pod in the namespace spark to access the dataset

6. View the data that the application can access through the dataset, you can see 3 folders: spark-3.1.3, spark-3.2.3, spark-3.3.1, and you can see that the mount permission is RW (ReadWriteMany)

Phase 2: The administrator creates a sub-dataset, pointing to the directory spark 3.1.3

1. Create namespace spark-313

2. Under the spark-313 namespace, the administrator creates:
- Reference the dataset spark, its mountPoint format is dataset://${the namespace where the initial dataset is located}/${the name of the initial dataset}/subdirectory, in this example it should be dataset://spark/spark/spark- 3.1.3
Note: The currently referenced dataset only supports one mount, and the form must be dataset:// (that is, dataset creation fails when dataset:// and other forms appear), and the permission of the dataset is read and write;

3. View dataset status

4. The user creates a Pod in the spark-313 namespace:

5. To access data in the spark-313 namespace, you can only see the content in the spark-3.1.3 folder, and you can see the pvc permission RWX

Phase 3: The administrator creates a sub-dataset, pointing to the directory spark 3.3.1

1. Create namespace spark-331

2. Under the spark-331 namespace, the administrator creates:
- Reference the dataset spark, its mountPoint format is dataset://${the namespace where the initial dataset is located}/${the name of the initial dataset}/subdirectory, in this example it should be dataset://spark/spark/spark- 3.3.1
Note: The currently referenced dataset only supports one mount, and the form must be dataset:// (that is, dataset creation fails when dataset:// and other forms appear), and the specified read-write permission is read-only ReadOnlyMany;

3. The user creates a Pod in the spark-331 namespace:


4. To access data in the spark-331 namespace, you can only see the content in the spark-3.3.1 folder, and you can see the PVC permission ROX (ReadOnlyMany)

Summarize:

In this example, we demonstrate the ability of Sub Dataset, which is to use a subdirectory of a Dataset as a Dataset. It is possible to realize that the same set of data sets can be used by different data scientists according to different segmentation strategies.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us