Share Datasets Across Namespaces via Fluid - ACK

When multiple teams run AI or ML workloads in separate Kubernetes namespaces, each team creating its own cache wastes storage and slows down data access. Fluid lets you cache a dataset once in a source namespace and share that cache with any number of reference namespaces — no duplicate caches, no extra runtime overhead.

The setup uses two namespace roles:

Source namespace (share): holds the Dataset and a cache runtime (JindoRuntime or JuiceFSRuntime). This is where the actual data cache lives.
Reference namespace (ref): holds a reference Dataset that points to the source Dataset using a dataset:// mount point. Pods in this namespace read from the shared cache without running their own cache runtime.

How it works

Fluid uses ThinRuntime to link a Dataset in one namespace to a Dataset in another. When a pod in the reference namespace reads data, requests are routed to the cache runtime in the source namespace. No additional cache runtime is created in the reference namespace.

Prerequisites

Before you begin, ensure that you have:

An ACK Pro cluster running Kubernetes 1.18 or later, with a non-ContainerOS node pool (the ack-fluid component does not support ContainerOS). For more information, see Create an ACK Pro cluster.
The cloud-native AI suite installed with the ack-fluid component deployed:
- If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install it. For more information, see Deploy the cloud-native AI suite.
- If the cloud-native AI suite is already installed, go to the Cloud-native AI Suite page in the ACK console and deploy the ack-fluid component.
- If open-source Fluid is already installed, uninstall it before deploying the ack-fluid component.
A kubectl client connected to your ACK Pro cluster. For more information, see Connect to a cluster by using kubectl.

Step 1: Upload the test dataset to OSS

Download the test dataset (approximately 2 GB).
Upload the dataset to your Object Storage Service (OSS) bucket using ossutil. For more information, see Install ossutil.

Step 2: Create a shared dataset and runtime

Create a namespace named share to hold the shared Dataset and runtime. Choose the runtime type that matches your storage setup.

JindoRuntime

Create the share namespace:
```
kubectl create ns share
```

Create a Secret to store the AccessKey pair for your OSS bucket:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: dataset-secret
  namespace: share
stringData:
  fs.oss.accessKeyId: <YourAccessKey ID>
  fs.oss.accessKeySecret: <YourAccessKey Secret>
EOF

Replace <YourAccessKey ID> and <YourAccessKey Secret> with your AccessKey ID and AccessKey secret. For more information, see Obtain an AccessKey pair.

Create a file named shared-dataset.yaml with the following content:

# Dataset: describes the data stored in OSS (the underlying file system, UFS).
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: shared-dataset
  namespace: share
spec:
  mounts:
  - mountPoint: oss://<oss_bucket>/<bucket_dir> # Path to the data in your OSS bucket.
    options:
      fs.oss.endpoint: <oss_endpoint> # Endpoint of your OSS bucket.
    name: hadoop
    path: "/"
    encryptOptions:
      - name: fs.oss.accessKeyId
        valueFrom:
          secretKeyRef:
            name: dataset-secret
            key: fs.oss.accessKeyId
      - name: fs.oss.accessKeySecret
        valueFrom:
          secretKeyRef:
            name: dataset-secret
            key: fs.oss.accessKeySecret
---
# JindoRuntime: enables JindoFS-based data caching in the cluster.
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: shared-dataset
  namespace: share
spec:
  replicas: 1
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        quota: 4Gi
        high: "0.95"
        low: "0.7"

For more information about configuring a Dataset and JindoRuntime, see Use JindoFS to accelerate access to OSS.

Apply the configuration:

kubectl apply -f shared-dataset.yaml

Expected output:

dataset.data.fluid.io/shared-dataset created
jindoruntime.data.fluid.io/shared-dataset created

Wait a few minutes, then verify that the Dataset is bound and the JindoRuntime is ready:

kubectl get dataset,jindoruntime -n share

Expected output:

NAME                                   UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
dataset.data.fluid.io/shared-dataset   1.16GiB          0.00B    4.00GiB          0.0%                Bound   4m1s

NAME                                        MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
jindoruntime.data.fluid.io/shared-dataset   Ready          Ready          Ready        15m

The Dataset is bound and the JindoRuntime is ready when all phases show Ready.

JuiceFSRuntime

Create the share namespace:
```
kubectl create ns share
```

Create a Secret to store the credentials for your OSS bucket and JuiceFS volume:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: dataset-secret
  namespace: share
type: Opaque
stringData:
  token: <JUICEFS_VOLUME_TOKEN>
  access-key: <OSS_ACCESS_KEY>
  secret-key: <OSS_SECRET_KEY>
EOF

Replace <OSS_ACCESS_KEY> and <OSS_SECRET_KEY> with your AccessKey ID and AccessKey secret. For more information, see Obtain an AccessKey pair.

Create a file named shared-dataset.yaml with the following content:

# Dataset: describes the data stored in OSS (the underlying file system, UFS).
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: shared-dataset
  namespace: share
spec:
  accessModes: ["ReadOnlyMany"]
  sharedEncryptOptions:
  - name: access-key
    valueFrom:
      secretKeyRef:
        name: dataset-secret
        key: access-key
  - name: secret-key
    valueFrom:
      secretKeyRef:
        name: dataset-secret
        key: secret-key
  - name: token
    valueFrom:
      secretKeyRef:
        name: dataset-secret
        key: token
  mounts:
  - name: <JUICEFS_VOLUME_NAME>
    mountPoint: juicefs:/// # Mount point of the JuiceFS file system.
    options:
      bucket: https://<OSS_BUCKET_NAME>.oss-<REGION_ID>.aliyuncs.com # Example: https://mybucket.oss-cn-beijing-internal.aliyuncs.com
---
# JuiceFSRuntime: enables JuiceFS-based data caching in the cluster.
apiVersion: data.fluid.io/v1alpha1
kind: JuiceFSRuntime
metadata:
  name: shared-dataset
  namespace: share
spec:
  replicas: 1
  tieredstore:
    levels:
    - mediumtype: MEM
      path: /dev/shm
      quota: 1Gi
      high: "0.95"
      low: "0.7"

Apply the configuration:

kubectl apply -f shared-dataset.yaml

Expected output:

dataset.data.fluid.io/shared-dataset created
juicefsruntime.data.fluid.io/shared-dataset created

Wait a few minutes, then verify that the Dataset is bound:

kubectl get dataset,juicefsruntime -n share

Expected output:

NAME                                   UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
dataset.data.fluid.io/shared-dataset   2.32GiB          0.00B    4.00GiB          0.0%                Bound   3d16h

NAME                                          WORKER PHASE   FUSE PHASE   AGE
juicefsruntime.data.fluid.io/shared-dataset                               3m50s

Step 3: Create a reference dataset and a pod

Create the ref namespace:
```
kubectl create ns ref
```
Create a file named ref-dataset.yaml with the following content:
- dataset:// — the protocol prefix, indicating this Dataset references another Dataset.
- share — the namespace where the source Dataset lives.
- shared-dataset — the name of the source Dataset.
Important
The mountPoint value must use the dataset:// protocol prefix. Any other format causes dataset creation to fail, and fields in the spec section have no effect.
```
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: ref-dataset
  namespace: ref
spec:
  mounts:
  - mountPoint: dataset://share/shared-dataset
```
The mountPoint value follows the format dataset://<namespace>/<dataset-name>:
Apply the reference Dataset:
```
kubectl apply -f ref-dataset.yaml
```

Create a file named app.yaml with the following content. This creates a pod in the ref namespace that mounts the reference Dataset at /data.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: ref
spec:
  containers:
  - name: nginx
    image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
    command:
    - "bash"
    - "-c"
    - "sleep inf"
    volumeMounts:
    - mountPath: /data
      name: ref-data
  volumes:
  - name: ref-data
    persistentVolumeClaim:
      claimName: ref-dataset

Deploy the pod:
```
kubectl apply -f app.yaml
```
Verify that the pod is running:
```
kubectl get pods -n ref -o wide
```
The pod is ready when its status shows Running.

Step 4: Test data sharing and caching

Check the pods in both namespaces:

kubectl get pods -n share
kubectl get pods -n ref

Expected output:

# Pods in the share namespace
NAME                                READY   STATUS    RESTARTS   AGE
shared-dataset-jindofs-fuse-ftkb5   1/1     Running   0          44s
shared-dataset-jindofs-master-0     1/1     Running   0          9m13s
shared-dataset-jindofs-worker-0     1/1     Running   0          9m13s

# Pods in the ref namespace
NAME    READY   STATUS    RESTARTS   AGE
nginx   1/1     Running   0          118s

Three cache-related pods run in the share namespace. The ref namespace has only the nginx pod — no cache runtime pods are created there.

Log in to the nginx pod:
```
kubectl exec nginx -n ref -it -- sh
```
Test data sharing by querying the file in the /data directory:
```
du -sh /data/wwm_uncased_L-24_H-1024_A-16.zip
```
Expected output:
```
1.3G    /data/wwm_uncased_L-24_H-1024_A-16.zip
```
The nginx pod in the ref namespace can access the file stored in the share namespace.

Test data caching by reading the file twice:

The following latency values are for reference only. Actual results vary based on your environment.

# First read — data is fetched from OSS and written to cache
time cat /data/wwm_uncased_L-24_H-1024_A-16.zip > /dev/null
real    0m1.166s
user    0m0.007s
sys     0m1.154s

# Second read — data is served from cache
time cat /data/wwm_uncased_L-24_H-1024_A-16.zip > /dev/null
real    0m0.289s
user    0m0.011s
sys     0m0.274s

The second read completes in 0.289 seconds compared to 1.166 seconds for the first read, confirming that the file is cached after the first access.