Fluid combined with JindoFS

1. What is Fluid?

Fluid is an open source Kubernetes native distributed data set orchestration and acceleration engine, mainly serving data-intensive applications in cloud native scenarios, such as big data applications, AI applications, etc. Through the data layer abstraction provided by the Kubernetes service, data can be flexibly and efficiently moved, copied, expelled, converted and managed between storage sources such as HDFS, OSS, Ceph, and the cloud native application computing on the upper layer of Kubernetes like a fluid. The specific data operation is transparent to users. Users no longer need to worry about the efficiency of accessing remote data, the convenience of managing data sources, and how to help Kubertes make operation and maintenance scheduling decisions. Users only need to directly access the abstracted data in the most natural Kubernetes native data volume, and the remaining tasks and underlying details are all handed over to Fluid for processing. The Fluid project currently focuses on two important scenarios: data set choreography and application choreography. Dataset orchestration can cache the data of the specified dataset to the Kubernetes node with the specified characteristics, while application orchestration will schedule the specified application to the node that can or has stored the specified dataset. The two can also be combined to form a collaborative choreography scenario, that is, to schedule node resources by considering data sets and application requirements.

Then introduce the concept of Dataset in Fluid. Datasets are logically related sets of data that can be used by computing engines, such as Spark of big data and TensorFlow of AI scenarios. The application and scheduling of dataset intelligence will create core value in industry. Dataset management also has multiple dimensions, such as security, version management and data acceleration.

We hope to provide support for data set management based on data acceleration. On the dataset, we define an execution engine such as Runtime to achieve the capabilities of dataset security, version management and data acceleration. Runtime defines a series of life-cycle interfaces, which can be implemented to support the management and acceleration of dataset. Currently, there are two types of Runtime supported in Fluid: AlluxioRuntime and JindoRuntime. Fluid's goal is to provide a layer of efficient and convenient data abstraction for AI and big data cloud native applications, and abstract data from storage to achieve the following functions:

1. Through data affinity scheduling and distributed cache engine acceleration, the integration between data and computing is realized, thus accelerating the access of computing to data.

2. Data is managed independently of storage, and resources are isolated through Kubernetes namespace to achieve data security isolation.

3. Combine the data from different storage for operation, so as to have the opportunity to break the data island effect caused by the differences of different storage.

2. What is JindoRuntime

If you want to understand the JindoRuntime of Fluid, first introduce JindoFS. It is the engine layer of Jindo Runtime.

JindoFS is a self-developed big data storage optimization engine developed by Alibaba Cloud for OSS. It is fully compatible with the Hadoop file system interface, bringing customers more flexible and efficient computing storage solutions. Currently, it has been verified that it supports all computing services and engines in Alibaba Cloud EMR: Spark, Flink, Hive, MapReduce, Presto, Impala, etc. JindoFS has two usage modes, block storage mode and cache mode. The Block mode stores the file contents in the form of data blocks on the OSS, and locally can choose to use data backup to speed up the cache. The local namespace service is used to manage metadata, so as to build the file data through local metadata and block data. The Cache mode stores files on OSS. This mode is compatible with the existing OSS file system. Users can access the original directory structure and files through OSS. At the same time, this mode provides data and metadata caching to accelerate the performance of users' reading and writing data. Users using this mode do not need to migrate data to OSS, and can seamlessly connect data on existing OSS. In terms of metadata synchronization, users can choose different metadata synchronization strategies according to different needs.

In Fluid, JindoRuntime also uses JindoFS's cache mode to access and cache remote files. If you need to use JindoFS alone in other environments to gain access to OSS, you can also download our JindoFS SDK and deploy it according to the use documents. The JindoRuntime comes from the JindoFS distributed system developed by the Alibaba Cloud EMR team. It is an implementation of the execution engine that supports dataset data management and caching. Fluid manages and schedules Jindo Runtime to achieve data set visibility, elastic scaling, data migration, computing acceleration, etc. The process of using and deploying JindoRuntime on Fluid is simple, compatible with the native k8s environment, and can be used out of the box. In deep combination with the object storage features, the navite framework is used to optimize performance, and supports cloud-based data security functions such as password-free and checksum verification.

3. Advantages of JindoRuntime

JindoRuntime provides the ability to access and cache Aliyun OSS object storage services, and uses FUSE's POSIX file system interface to easily use massive files on OSS like local disks, with the following features:

1. Excellent performance

• Outstanding OSS read and write performance: enhance the read and write efficiency and stability in combination with OSS, optimize the access interface to OSS through the native layer, and optimize the cold data access performance, especially the small file read and write

• Rich distributed cache strategies: support single-terabyte large file cache, metadata cache strategies, etc. It has outstanding performance advantages in large-scale AI training and data lake scene measurement.

2. Safe and reliable

• Authentication security: supports STS password-free access and K8s native key encryption on Alibaba Cloud

• Data security: Checksum verification, client data encryption and other security policies to protect data security and user information on the cloud.

3. Easy to use

• Support the native k8s environment, use the user-defined resource definition, and dock the data volume concept. The deployment process is simple and can be used out of the box.

4. Lightweight

• The bottom layer is based on c++code, the overall structure is lightweight, and the additional cost of various OSS access interfaces is small.

4. How about the performance of JindoRuntime

We use the ImageNet dataset based on the Kubernetes cluster and use Arena to train the ResNet-50 model on this dataset. The performance of JindoRuntime based on JindoFS is significantly better than the open source OSSFS when the local cache is enabled, and the training time is reduced by 76%. This test scenario will be described in detail in the following article.

5. How to use JindoRuntime quickly

The process of using the JindoRuntime is simple. With the basic k8s and OSS environment ready, you can deploy the required JindoRuntime environment in about 10 minutes. You can deploy it according to the following process.

1. Create namespace

kubectl create ns fluid-system

2. Download fluid-0.5.0.tgz

3. Installing Fluid with Help

helm install --set runtime.jindo.enabled=true fluid fluid-0.5.0.tgz

4. View the running status of Fluid

The number of csi-nodepluin-fluid-xx should be the same as the number of node nodes in the k8s cluster.

5. Create dataset and JindoRuntime

Before creating the dataset, we can create a secret to save the fs.oss.accessKeyId and fs.oss.accessKeySecret information of OSS to avoid exposure of clear text. k8s will use encryption code for the created secret, and fill the key and secret information into the mySecret.yaml file.

Create a resource.yaml file that contains two parts:

1. First, it contains the dataset information of the dataset and ufs, and creates a dataset CRD object, which describes the source of the dataset.

2. Next, you need to create a JindoRuntime, which is equivalent to starting a JindoFS cluster to provide caching services.

1. MountPoint: oss:///indicates the path to mount UFS. The path does not need to contain endpoint information.

2. Fs. oss. endpoint: The endpoint information of the oss bucket, which can be either public or intranet addresses.

3. Replicas: Indicates the number of workers who created the JindoFS cluster.

4. Mediumtype: JindoFS temporarily supports only one of HDD/SSD/MEM.

5. Path: storage path. Only one disk is supported temporarily. When MEM is selected as cache, a disk is also needed to store log and other files.

6. Quota: Maximum cache capacity, in Gi.

7. High: upper water level size/low: lower water level size.

kubectl create -f resource.yaml

View the data set

6. Create application container to experience acceleration effect

You can use the JindoFS acceleration service by creating an application container, or submit machine learning jobs to experience relevant functions.

Next, we will create an application container app.yaml to use this dataset. We will access the same data multiple times and compare the access time to show the acceleration effect of JindoRuntime.

Use kubectl to complete the creation

Check the cache of the dataset at this time and find that 210MB of data has been cached locally.

In order to avoid other factors (such as page cache) affecting the results, we will delete the previous container, create the same application, and try to access the same file. Since the file has been cached by JindoFS at this time, it can be seen that the second access takes much less time than the first.

kubectl delete -f app.yaml && kubectl create -f app.yaml

The observation time of file copy is 48 ms, and the whole copy time is reduced by 300 times

7. Environmental cleaning

1. Delete apps and app containers

2. Delete JindoRuntime

kubectl delete jindoruntime hadoop

3. Delete dataset

kubectl delete dataset hadoop

The above completes the introduction experience and understanding of JindoFS on Fluid through a simple example, and finally cleans up the environment. More functions of Fluid JindoRuntime will be introduced in detail in the following articles

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us