By Biran
You can use the cloud-native architecture to run tasks on the cloud (such as AI and big data). You can enjoy the advantages of elastic computing resources. However, you encounter the challenges of data access latency and high bandwidth overhead for remote data pull caused by the separation of computing and storage. The iterative remote reading of many training data will slow down the GPU computing efficiency, especially in GPU deep learning training scenarios.
On the other hand, Kubernetes only provides heterogeneous storage service access and management standard interfaces (Container Storage Interface (CSI)). It does not define how applications use and manage data in container clusters. When running training tasks, data scientists need to be able to manage Dataset versions, control access permissions, preprocess Dataset, accelerate heterogeneous data reads, and more. However, there is no such standard solution in Kubernetes, which is one of the important capabilities missing from the cloud-native container community.
Fluid abstracts the process of using data for computing tasks and proposes the concept of elastic Dataset, which is implemented in Kubernetes as a first class citizen. Fluid creates a data orchestration and acceleration system around the elastic Dataset to implement capabilities (such as Dataset management (CRUD operations), permission control, and access acceleration).
Fluid has two core concepts: Dataset and Runtime.
By default, the earliest mode of Fluid supports a Dataset exclusive to one runtime, which can be understood as a Dataset with the dedicated cache cluster acceleration. It can be customized and optimized for the characteristics of the Dataset (such as single file size, file quantity scale, and the number of clients). A separate caching system is provided. It provides the best performance and stability and does not interfere with each other. However, it is a waste of hardware resources, which requires the deployment of cache systems for different Datasets. In addition, it is complex to maintain and manage multiple cache runtimes. This mode is a single-tenant architecture and is suitable for scenarios with high requirements for data access throughput and latency.
With the deepening of the use of Fluid, there are different needs. For example, users will create data-intensive jobs in multiple different namespaces, and these jobs will access the same Dataset. Multiple data scientists share the same Dataset, and each data scientist has his independent namespace to submit jobs. If you redeploy the cache system for each namespace and warm up the cache, data redundancy and job startup latency may occur.
As such, community users reduce the performance requirements to save resources and simplify O&M and begin to need to access Datasets across namespaces. The cross-namespace requirement is calling for a multi-tenant architecture, meaning the cluster administrator points the runtime to the root directory of storage. Multiple data scientists can create multiple Datasets in different namespaces to share the same runtime. Furthermore, administrators can configure subdirectories and different read and write permissions for data scientists in different namespaces.
There is no silver bullet to all architectural choices but trade-offs. This article uses AlluxioRuntime as an example to explain how to use Fluid to share a runtime.
Imagine that User A preheats the Dataset spark in the Kubernetes namespace development, and User B accesses the Dataset spark in another namespace production. Fluid can help User B access the cached data in the namespace production without secondary preheating, which simplifies user usage. It can be preheated at one time, and users in different namespaces get benefits.
1. Before you run the sample code, install the sample code (only in the master branch currently) by referring to the installation documentation. Check that the Fluid components are running properly.
NAME READY STATUS RESTARTS AGE
csi-nodeplugin-fluid-mwx59 2/2 Running 0 5m46s
csi-nodeplugin-fluid-tcbfd 2/2 Running 0 5m46s
csi-nodeplugin-fluid-zwm8t 2/2 Running 0 5m46s
dataset-controller-5c7557c4c5-q58bb 1/1 Running 0 5m46s
fluid-webhook-67fb7dffd6-h8ksp 1/1 Running 0 5m46s
fluidapp-controller-59b4fcfcb7-b8tx5 1/1 Running 0 5m46s
2. Create a namespace development:
$ kubectl create ns development
3. Create a Dataset and AlluxioRuntime in the namespace development:
$ cat<<EOF >dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: spark
namespace: development
spec:
mounts:
- mountPoint: https://mirrors.bit.edu.cn/apache/spark/
name: spark
path: "/"
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: spark
namespace: development
spec:
replicas: 1
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 4Gi
high: "0.95"
low: "0.7"
EOF
4. View the status of a Dataset:
$ kubectl get dataset -A
NAMESPACE NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE
development spark 3.41GiB 0.00B 4.00GiB 0.0% Bound 2m54s
5. Create a Pod Access Dataset in the namespace development:
$ cat<<EOF >app.yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
namespace: development
spec:
containers:
- name: nginx
image: nginx
volumeMounts:
- mountPath: /data
name: spark
volumes:
- name: spark
persistentVolumeClaim:
claimName: spark
EOF
$ kubectl create -f app.yaml
6. View the data the application can access through the Dataset and copy it. Copying 1.4G of data (7 files) took 3 minutes and 16 seconds:
$ kubectl exec -it -n development nginx -- ls -ltr /data
total 2
dr--r----- 1 root root 6 Dec 4 15:39 spark-3.1.3
dr--r----- 1 root root 7 Dec 4 15:39 spark-3.2.3
dr--r----- 1 root root 7 Dec 4 15:39 spark-3.3.1
$ kubectl exec -it -n development nginx -- bash
root@nginx:/# time cp -R /data/spark-3.3.1/* /tmp
real 3m16.761s
user 0m0.021s
sys 0m3.520s
root@nginx:/# du -sh /tmp/
1.4G /tmp/
root@nginx:/# du -sh /tmp/*
348K /tmp/SparkR_3.3.1.tar.gz
269M /tmp/pyspark-3.3.1.tar.gz
262M /tmp/spark-3.3.1-bin-hadoop2.tgz
293M /tmp/spark-3.3.1-bin-hadoop3-scala2.13.tgz
286M /tmp/spark-3.3.1-bin-hadoop3.tgz
201M /tmp/spark-3.3.1-bin-without-hadoop.tgz
28M /tmp/spark-3.3.1.tgz
7. Load a specified Dataset subdirectory using dataload:
$ cat<<EOF >dataload.yaml
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: spark
namespace: development
spec:
dataset:
name: spark
namespace: development
target:
- path: /spark-3.3.1
EOF
$ kubectl create -f dataload.yaml
8. View the dataload status:
$ kubectl get dataload -A
NAMESPACE NAME DATASET PHASE AGE DURATION
development spark spark Complete 5m47s 2m1s
9. Check the cache effect, and you can see that 38.4% of data has been cached:
$ kubectl get dataset -n development
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE
spark 3.41GiB 1.31GiB 4.00GiB 38.4% Bound 79m
10. It only takes 0.8 seconds to copy 1.4G of data again, and the access speed is 245 times higher than before:
$ kubectl exec -it -n development nginx -- bash
root@nginx:/# time cp -R /data/spark-3.3.1/* /tmp
real 0m0.872s
user 0m0.009s
sys 0m0.859s
root@nginx:/# du -sh /tmp/
1.4G /tmp/
root@nginx:/# du -sh /tmp/*
348K /tmp/SparkR_3.3.1.tar.gz
269M /tmp/pyspark-3.3.1.tar.gz
262M /tmp/spark-3.3.1-bin-hadoop2.tgz
293M /tmp/spark-3.3.1-bin-hadoop3-scala2.13.tgz
286M /tmp/spark-3.3.1-bin-hadoop3.tgz
201M /tmp/spark-3.3.1-bin-without-hadoop.tgz
28M /tmp/spark-3.3.1.tgz
11. Create a production namespace:
$ kubectl create ns production
12. In the production namespace, create:
Refer to the spark
reference. The mountPoint format is in the dataset://${namespace of the initial dataset}/${name of the initial dataset}
. In this example, it is the initial dataset.
Note: The currently referenced Dataset only supports one mount, and the format must be dataset://
. (This means Dataset creation fails when dataset://
or other formats occur.) Other fields in Spec are invalid.
$ cat<<EOF >spark-production.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: spark
namespace: production
spec:
mounts:
- mountPoint: dataset://development/spark
EOF
$ kubectl create -f spark-production.yaml
13. View the Dataset, and the Spark Dataset in the production namespace is cached:
$ kubectlkubectl get dataset -n production
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE
spark 3.41GiB 1.31GiB 4.00GiB 38.4% Bound 14h
14. In the production namespace, create a pod:
$ cat<<EOF >app-production.yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
namespace: production
spec:
containers:
- name: nginx
image: nginx
volumeMounts:
- mountPath: /data
name: spark
volumes:
- name: spark
persistentVolumeClaim:
claimName: spark
EOF
$ kubectl create -f app-production.yaml
15. It takes 0.878s to access data in the production namespace.
$ kubectl exec -it -n production nginx -- ls -ltr /data
total 2
dr--r----- 1 root root 6 Dec 4 15:39 spark-3.1.3
dr--r----- 1 root root 7 Dec 4 15:39 spark-3.2.3
dr--r----- 1 root root 7 Dec 4 15:39 spark-3.3.1
$ kubectl exec -it -n production nginx -- bash
root@nginx:/# ls -ltr /tmp/
total 0
root@nginx:/# time cp -R /data/spark-3.3.1/* /tmp
real 0m0.878s
user 0m0.014s
sys 0m0.851s
root@nginx:/# du -sh /tmp
1.4G /tmp
root@nginx:/# du -sh /tmp/*
348K /tmp/SparkR_3.3.1.tar.gz
269M /tmp/pyspark-3.3.1.tar.gz
262M /tmp/spark-3.3.1-bin-hadoop2.tgz
293M /tmp/spark-3.3.1-bin-hadoop3-scala2.13.tgz
286M /tmp/spark-3.3.1-bin-hadoop3.tgz
201M /tmp/spark-3.3.1-bin-without-hadoop.tgz
28M /tmp/spark-3.3.1.tgz
The preceding example shows how to use Fluid to share Dataset across namespaces. In the next article, Fluid will support cross-namespace Dataset access on Serverless Kubernetes. There is no difference in the entire user experience.
In the next article, we will support the ability of SubDataset, which is to use a subdirectory of a Dataset as a Dataset to implement the same set of caches. Stay tuned.
Practices for Building GitOps Delivery Based on ACK One and ACR
162 posts | 29 followers
FollowAlibaba Container Service - August 16, 2024
Alibaba Cloud Native - July 14, 2023
Alibaba Developer - June 21, 2021
Alibaba Container Service - February 21, 2023
Alibaba Developer - July 8, 2021
Alibaba Container Service - August 16, 2024
162 posts | 29 followers
FollowAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreAlibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.
Learn MoreAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreMore Posts by Alibaba Container Service