JindoRuntime is based on C++ and supports dataset management, data caching, and data storage in OSS. Fluid enables the observability, auto scaling, and portability of datasets by managing and scheduling JindoRuntime. This topic describes how to use Fluid to accelerate data access in scenarios in which ACS compute power is used.
Prerequisites
Object Storage Service (OSS) is activated. For more information, see Activate OSS.
ack-fluid 1.0.11-* or later is installed. For more information, see Use Helm to manage applications in ACS.
The privileged mode is enabled for ACS pods.
NoteThe privileged mode is required for using Fluid to accelerate data access. To enable this mode, submit a ticket.
Procedure
Step 1: Upload data to OSS
Run the following command to download the test data.
wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgzUpload the test dataset to the OSS bucket.
ImportantThis example describes how to upload a test dataset to OSS from an ECS instance that runs the Alibaba Cloud Linux 3.2104 LTS 64-bit operating system. If you use other operating systems, see ossutil command reference and ossutil 1.0.
Run the following command to create a bucket named
examplebucket.Noteif the command returns
ErrorCode=BucketAlreadyExists, the bucket already exists. OSS bucket names must be globally unique. Modify theexamplebucketname on demand.ossutil64 mb oss://examplebucketExpected results:
0.668238(s) elapsedIf the preceding output is displayed, the bucket named
examplebucketis created.Upload the test dataset to
examplebucket.ossutil64 cp spark-3.0.1-bin-hadoop2.7.tgz oss://examplebucket(Optional) Configure permissions to access the bucket and data For more information, see Permission control.
Create a file named
mySecret.yamland add the following content to the file.apiVersion: v1 kind: Secret metadata: name: mysecret stringData: fs.oss.accessKeyId: xxx fs.oss.accessKeySecret: xxxfs.oss.accessKeyIdandfs.oss.accessKeySecretspecify theAccessKey IDandAccessKey secretused to access OSS.Run the following command to create a Secret: Kubernetes automatically encrypts Secrets to avoid disclosing sensitive data in plaintext.
kubectl create -f mySecret.yaml
Step 2: Create a dataset and a JindoRuntime
Create a file named
resource.yamland add the following content to the file.Create a dataset to specify information about the datasets in remote storage and the underlying file system (UFS).
Create a JindoRuntime to launch a JindoFS cluster for data caching.
NoteRun the
kubectl get pods --field-selector=status.phase=Running -n fluid-systemcommand to check whether the dataset-controller and jindoruntime-controller of the ack-fluid component run as normal.In this example, CPU compute power is preferably used. To accelerate the loading of LLMs, make sure that the zone of your cluster provides GPU resources. For more information, see Introduction to GPU compute classes.
The following table describes the parameters.
Parameter
Description
mountPoint
oss://<oss_bucket> indicates the UFS path that is mounted. <oss_bucket> indicates the name of the OSS bucket, such as
oss://examplebucket.fs.oss.endpoint
The endpoint of the OSS bucket. You can specify the public or private endpoint. Example:
oss-cn-beijing-internal.aliyuncs.com. For more information, see OSS regions and endpoints.replicas
The number of workers in the JindoFS cluster.
mediumtype
The type of cache. When you create a JindoRuntime template, JindoFS supports only one of the following cache types:
HDD,SDD, andMEM.path
The storage path. You can specify only one path. If you set mediumtype to
MEM, you must specify a path of the on-premises storage to store data, such as log.quota
The maximum size of the cached data. Unit: GB.
high
The upper limit of the storage capacity.
low
The lower limit of the storage capacity.
Run the following command to create a dataset and a JindoRuntime:
kubectl create -f resource.yamlView the deployment of the JindoRuntime and dataset.
View the deployment of the dataset.
kubectl get dataset hadoopExpected results:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hadoop 209.74MiB 0.00B 4.00GiB 0.0% Bound 56sView the deployment of the JindoRuntime.
kubectl get jindoruntime hadoopExpected results:
NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE hadoop Ready Ready Ready 2m11sThe preceding outputs indicate that the dataset and JindoRuntime are created.
Run the following command to check whether the persistent volume (PV) and persistent volume claim (PVC) are created: The PVS uses the name of the dataset.
kubectl get pv,pvcExpected results:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE persistentvolume/default-hadoop 100Pi ROX Retain Bound default/hadoop fluid <unset> 2m5s NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE persistentvolumeclaim/hadoop Bound default-hadoop 100Pi ROX fluid <unset> 2m5s
Step 3: Create a DataLoad resource
To accelerate data loading and ensure the validity of the data processing logic, you need to preload the dataset once.
If the model data stored in the OSS bucket is static, you create a file named dataload.yaml and add the following content to the file to preload the data.
apiVersion: data.fluid.io/v1alpha1 kind: DataLoad metadata: name: hadoop spec: dataset: name: hadoop namespace: default loadMetadata: trueIf the model data stored in the OSS bucket is dynamic, you need to periodically preload the data. For more information, see Scenario 2: Data in the backend storage is read-only but periodically changes.
Create a DataLoad resource to preload the model data once.
kubectl create -f dataload.yamlView the status of preloading.
kubectl get dataloadExpected results:
NAME DATASET PHASE AGE DURATION hadoop hadoop Complete 92m 51s
Step 4: Create pods to verify data acceleration
You can create pods or submit a machine learning job to verify the JindoFS data acceleration service. In this example, an application is deployed in a container to test access to the same data. The test is run multiple times to compare the time consumption.
Create a file named app.yaml by using the following YAML template:
apiVersion: v1 kind: Pod metadata: name: demo-app labels: # To mount ACS pods, use Fluid webhook to inject Fluid-related components as sidecar containers. You must configure the following label: alibabacloud.com/fluid-sidecar-target: acs spec: containers: - name: demo image: mirrors-ssl.aliyuncs.com/nginx:latest volumeMounts: - mountPath: /data name: hadoop resources: requests: cpu: 14 memory: 56Gi volumes: - name: hadoop persistentVolumeClaim: ## The name of the Fluid dataset. claimName: hadoop nodeSelector: type: virtual-kubelet tolerations: - key: virtual-kubelet.io/provider operator: Equal value: alibabacloud effect: NoScheduleRun the following command to create an application pod:
kubectl create -f app.yamlTest the file copy speed without using JindoFS caches.
View the size of the test file.
kubectl exec -it demo-app -c demo -- du -sh /data/spark-3.0.1-bin-hadoop2.7.tgzExpected results:
210M /data/spark-3.0.1-bin-hadoop2.7.tgzView the amount of time required to copy the file.
time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/nullExpected results:
real 0m1.883s user 0m0.001s sys 0m0.041sThe preceding output indicates that it takes 1.883 seconds to copy the file.
View the dataset cache.
kubectl get dataset hadoopExpected results:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE hadoop 209.74MiB 209.74MiB 4.00GiB 100.0% Bound 64mThe preceding output indicates that
100.0%of the data is cached by JindoFS.Delete the sample pod and view the file copy time.
NoteYou need to delete the sample pod to eliminate the impact of other factors, such as page caches. If the pod already contains a local cache, the system preferably copies the file from the local cache.
Run the following command to query the time required to copy the file:
kubectl exec -it demo-app -c demo -- bash time cp /data/spark-3.0.1-bin-hadoop2.7.tgz /dev/nullExpected results:
real 0m0.203s user 0m0.000s sys 0m0.047sThe preceding output indicates that it takes 0.203 seconds to copy the file, which is nine times faster than the first time. This is because the file is cached by JindoFS. It is much faster to access a cached file.
ImportantThe copy time provided in this topic is for reference only.
Use ACS compute power in ACK Pro clusters
This topic describes how to use JindoFS to accelerate file copy operations based on ACS clusters. You can also use ACS compute power in ACK managed clusters to complete the operation. For more information, see Use the computing power of ACS in ACK Pro clusters.
To verify data acceleration in an ACK managed cluster, make the following adjustments:
Install the ack-fluid component in the ACK managed cluster. For more information, see Use Helm to simplify application deployment.
Create a dataset and JindoRuntime based on the following content.
apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: hadoop spec: mounts: ## To specify subdirectories, configure oss://<oss_bucket>/{oss_path}. - mountPoint: oss://<oss_bucket> # Replace <oss_bucket> with the actual value. options: fs.oss.endpoint: <oss_endpoint> # Replace <oss_endpoint> with the actual value. name: hadoop path: "/" encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeySecret --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: hadoop spec: ## Modify on demand. replicas: 4 tieredstore: levels: - mediumtype: MEM path: /dev/shm volumeType: emptyDir quota: 48Gi high: "0.99" low: "0.95"Differences between ACK managed clusters and ACS clusters:
The nodes of ACS clusters cannot be scaled in the same way as ACK clusters because these nodes are virtual nodes. Therefore, you must configure
.spec.placement: Sharedandnetworkmode.Fluid workers require high bandwidth. You must make sure that your ACS cluster has sufficient bandwidth. To do this, you can configure
compute-class: performanceandresourcesto ensure that the ACS pods have sufficient bandwidth.