Sharing data sets across Kubernetes Namespaces

What is Fluid?

Running tasks such as AI and big data on the cloud through the cloud-native architecture can enjoy the advantages of computing resource elasticity, but at the same time, it will also encounter the data access delay caused by the separation of computing and storage architecture and the high bandwidth overhead of remotely pulling data. challenge. Especially in GPU deep learning training scenarios, iterative remote reading of a large amount of training data will seriously slow down GPU computing efficiency.

On the other hand, Kubernetes only provides a standard interface (CSI, Container Storage Interface) for heterogeneous storage service access and management, and does not define how applications use and manage data in container clusters. When running training tasks, data scientists need to be able to manage dataset versions, control access permissions, dataset preprocessing, and accelerate heterogeneous data reading, etc. But there is no such standard solution in Kubernetes, which is one of the important capabilities missing from the cloud-native container community.

Fluid abstracts the "process of using data for computing tasks", proposes the concept of elastic dataset Dataset, and implements it in Kubernetes as a "first class citizen". Fluid creates a data orchestration and acceleration system around the elastic dataset Dataset to realize the capabilities of Dataset management (CRUD operation), permission control and access acceleration.

There are two core concepts in Fluid: Dataset and Runtime.

• Dataset refers to a data set, which is a collection of logically related data, which will be used by computing engines, such as Spark, TensorFlow, PyTorch, etc.
• Runtime refers to the system that provides distributed caching. Currently, the runtime types supported by Fluid include JuiceFS, Alluxio, JindoFS, and GooseFS. Alluxio and JindoFS are typical distributed caching engines; JuiceFS is a distributed file system with Distributed cache capability. These cache systems use the storage resources (such as memory and disk) on the nodes in the Kubernetes cluster as the computing-side cache of the remote storage system.

Why does Fluid need to support sharing across namespaces?

The earliest mode of Fluid supports a Dataset to monopolize a Runtime by default, which can be understood as a dedicated cache cluster acceleration for a Dataset. It can be customized and optimized according to the characteristics of the data set, such as the size of a single file, the number of files, and the number of clients; and a separate caching system is provided. It can provide the best performance and stability, and will not interfere with each other, but its problem is that hardware resources are wasted, and a cache system needs to be deployed for each different data set; at the same time, operation and maintenance are complex, and multiple cache runtimes need to be managed. In fact, this mode is essentially a single-tenant architecture, which is suitable for scenarios with high requirements on data access throughput and latency.

Of course, with the deepening of the use of Fluid, there are also different requirements. For example, users will create data-intensive jobs in multiple different Namespaces, and these jobs will access the same dataset. Multiple data scientists share the same data set, and each data scientist has its own independent Namespace to submit jobs. If the cache system is re-deployed for each Namespace and the cache is warmed up, it will cause data redundancy and job startup delays.

At this time, in order to save resources and reduce performance requirements for simple operation and maintenance, community users began to have the need to access datasets across Namespaces. The cross-namespace requirement essentially calls for a multi-tenant architecture, that is, the cluster administrator points the Runtime to the root directory of a certain storage, and multiple data scientists can create multiple Datasets in different Namespaces to share the same Runtime. Taking it a step further, administrators can configure subdirectories and different read and write permissions for data scientists in different Namespaces.

There is no silver bullet in all architectural choices, they are all trade-offs. This article takes AlluxioRuntime as an example to show you how to use Fluid to share Runtime.

Example of use:

Imagine that user A warms up the dataset spark in the Kubernetes namespace development, and user B also needs to access the dataset spark in another namespace production. Fluid can help user B access the cache in the namespace production The processed data does not require secondary warm-up, which simplifies user use. One warm-up can be done, and users of different namespaces can get benefits.

1. Before running this example, please refer to the installation document to complete the installation (currently this function exists in the master branch). And check that each component of Fluid is running normally:

2. Create namespace development

3. Create Dataset and AlluxioRuntime in namespace development,

4. Check the status of the dataset

5. Create a Pod in the namespace development to access the dataset

6. View the data that the application can access through the dataset, and perform a copy. It can be found that copying 1.4G data (7 files) took 3 minutes and 16 seconds.

7. Load the specified dataset subdirectory through dataload

8. View dataload status

9. Check the cache effect, you can see that 38.4% of the data has been cached

10. Copying 1.4G data again takes only 0.8 seconds and less than 1 second, and the access speed is 245 times higher than the previous one

11. Create the production namespace

12. Under the production namespace

13. View the data set, you can see the spark data set under the production namespace, and the data cache has been completed

14. In the production namespace, create a Pod:

Summarize:

The above example shows how to use Fluid to realize the ability to share datasets across namespaces. In the next step, Fluid will support cross-Namespace dataset access on Serverless Kubernetes. In fact, there is no difference in the entire user experience for users.
In the next step, we will support the ability of Sub Dataset, that is, to use a subdirectory of a Dataset as a Dataset. It is possible to implement the same set of caches, which is suitable for different data scientists, so stay tuned.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us