The Fluid dashboards expose observability metrics for the Fluid data acceleration framework running in your ACK cluster. Two dashboards are available:
-
Fluid control plane dashboard — monitors the health and performance of Fluid's control plane components: dataset controller, runtime controller, webhook, and CSI plug-in.
-
Fluid JindoRuntime cache dashboard — monitors the cache efficiency and resource usage of a specific JindoRuntime cache system.
Use these dashboards to detect component failures, diagnose cache performance issues, and identify optimization opportunities before they affect workloads.
Prerequisites
Before you begin, ensure that you have:
-
Managed Service for Prometheus enabled for the Fluid component. For more information, see Step 2: View the Fluid dashboard.
Fluid control plane dashboard
Dashboard variables
Variables control the scope and granularity of data displayed across all panels. Changing a variable updates all related panels simultaneously.
| Variable | Valid values | Description |
|---|---|---|
interval |
1m, 5m, 10m, 30m, 1h, 6h | The monitoring cycle duration. Shorter intervals show finer-grained trends; longer intervals smooth out spikes. |
quantile |
0.5, 0.75, 0.90, 0.95, 0.99 | The percentile used by latency and processing-time panels. For example, 0.90 = P90. |
runtime |
JindoRuntime, AlluxioRuntime, JuiceFSRuntime | The runtime type to monitor. Changing this variable filters all runtime-related panels to the selected runtime. |
Runtime types:
-
JindoRuntime — the execution engine of JindoFS, developed by the Alibaba Cloud Elastic MapReduce (EMR) team. Built in C++, JindoRuntime provides dataset management, caching, and Object Storage Service (OSS) support.
-
AlluxioRuntime — the execution engine of open source Alluxio. Supports dataset management, caching, and accelerated access to persistent volume claims (PVCs), Ceph, and Cloud Parallel File System (CPFS). Suited for hybrid cloud scenarios.
-
JuiceFSRuntime — a distributed cache acceleration engine based on JuiceFS. Supports scenario-specific data caching and acceleration. For more information, see Introduction to JuiceFS.
Panels
The control plane dashboard is organized into four panel groups. Start with Component running status for a quick health check. If something looks abnormal, drill into Fluid Controller Detailed Indicator or Fluid webhook detailed indicators to pinpoint the cause. Resource usage provides supporting CPU, memory, and network data for all controller pods.
Component running status
This group shows whether each Fluid component is running and how often pods are restarting. Frequent restarts are the first sign of instability.
| Panel | Description |
|---|---|
| Dataset Controller Ready Replicas | Number of dataset controller pods in the Running state. If this drops below the expected replica count, dataset operations may stall. |
| History of Dataset controller restarts | Restart count of dataset controller pods. |
| Runtime Number of ready copies of controller | Number of runtime controller pods in the Running state. |
| History Runtime Controller Restart Times | Restart count of runtime controller pods. |
| Fluid Webhook ready copies | Number of Fluid webhook pods in the Running state. |
| Number of historical fluid Webhook restarts | Restart count of Fluid webhook pods. |
| Fluid CSI Plug-in Ready Copies | Number of Fluid CSI plug-in pods in the Running state. |
| Historical Fluid CSI plug-in restarts | Restart count of Fluid CSI plug-in pods. |
| Fluid Component Restart | The top five Fluid components with the most restarts within the last 2-minute monitoring cycle. Use this panel to quickly identify which component needs attention. |
Fluid Controller Detailed Indicator
This group exposes internal performance metrics of the runtime and DataLoad controllers. Use it when you observe slow dataset reconciliation or increasing API server load.
| Panel | Description |
|---|---|
| Runtime Controller processing time | Time the runtime controller spends handling runtime resources within a monitoring cycle, displayed as percentile values. Sustained high values may indicate controller overload. |
| Number of Runtime controller processing failures | Types and counts of failures during runtime resource handling: runtime deployment failures and runtime health check failures. Non-zero values require investigation. |
| Runtime Number of controller threads | Current active threads and maximum supported threads of the runtime controller. If active threads approach the maximum, the controller may become a bottleneck. |
| DataLoad Controller Threads | Current active threads and maximum supported threads of the DataLoad controller. |
| Controller Queue Length | The workqueue length of each Fluid controller. A growing queue indicates the controller is not keeping up with reconciliation demand. |
| Total number of Kubernetes API requests | Total requests sent by all Fluid controller pods to the Kubernetes API server within a monitoring cycle. Sudden spikes may cause API server throttling. |
| Runtime Controller Kubernetes API requests | Requests from the runtime controller to the Kubernetes API server, broken down by HTTP status code. A high proportion of 4xx or 5xx responses points to misconfiguration or permission issues. |
| Total time consumed by unfinished processing of controller | Cumulative time each Fluid controller has spent on in-progress tasks. Persistently high values suggest tasks are getting stuck. |
Fluid webhook detailed indicators
This group monitors the Fluid webhook, which intercepts pod creation requests to inject FUSE sidecars. Webhook latency directly affects how long it takes for new pods to start.
| Panel | Description |
|---|---|
| Fluid Webhook Pod CPU Usage | CPU utilization of each Fluid webhook pod within a monitoring cycle. |
| Fluid Webhook Pod Memory Usage | Memory usage of each Fluid webhook pod within a monitoring cycle. |
| Total number of requests processed in Fluid Webhook | Total requests handled by the Fluid webhook within a monitoring cycle. |
| The number of requests processed in each Fluid Webhook Pod | Requests handled by each individual Fluid webhook pod within a monitoring cycle. Use this to spot load imbalance across replicas. |
| Fluid Webhook Request Processing Delay | Overall request processing latency of the Fluid webhook, as a percentile value. High P99 latency slows down pod startup cluster-wide. |
| Request processing delay of each Fluid Webhook Pod | Per-pod request processing latency, as a percentile value. Useful for identifying a single slow pod causing tail latency. |
Resource usage
This group provides CPU, memory, and network metrics for all Fluid controller pods. Use it to detect resource pressure that could cause the issues visible in the other panel groups.
| Panel | Description |
|---|---|
| CPU usage | CPU utilization of each Fluid controller pod within a monitoring cycle. |
| Memory usage | Memory usage of each Fluid controller pod within a monitoring cycle. |
| Network Send Rate per Pod | Network transmit rate of each Fluid controller pod within a monitoring cycle. |
| Network Receive Rate per Pod | Network receive rate of each Fluid controller pod within a monitoring cycle. |
Fluid JindoRuntime cache dashboard
Dashboard variables
Select a dataset by namespace and name to scope all panels to that dataset's cache system.
| Variable | Description |
|---|---|
namespace |
The namespace of the target dataset in the cluster. |
fluid_dataset |
The name of the target Fluid dataset in the cluster. |
Panels
The JindoRuntime cache dashboard is organized into three panel groups. Start with Dataset overview to confirm all cache pods are healthy. Then check Cache system metrics for cache efficiency and bandwidth. If you suspect FUSE-level issues — such as high latency reported by applications — use FUSE metrics to isolate the problem.
Dataset overview
| Panel | Description |
|---|---|
| Ready Pod Num | Number of ready pods in each component of the selected cache system, including master, worker, and FUSE components. |
| Pod Overview | Basic information about pods in each component: restart count in the last hour, CPU resource requests and limits, and memory resource requests and limits. |
Cache system metrics
This group covers the core health indicators of the cache: how full it is, how effectively data is being served from cache, and how much bandwidth it provides to applications.
| Panel | Description |
|---|---|
| Cache Capacity Usage (%) | Proportion of cache capacity currently in use. |
| Cache Capacity Usage | Maximum available cache capacity alongside current usage, in absolute values. |
| Cache Hit Ratio Per Minute | Per-minute cache hit rate of the selected cache system. |
| Read Bytes Per Minute | Per-minute data reads, split into cache hits (Cache Hit) and cache misses served from the backend storage (From Backend). A high From Backend share means most reads bypass cache. |
| Cache System Aggregated Bandwidth | Sum of outbound traffic across all worker pod network interfaces, representing the total bandwidth the cache system delivers to applications. Note
If worker pods run on the host network, this value may be inflated. For accurate readings, run worker pods on the container network. |
| Cache Worker Pod Network I/O | Per-worker-pod network I/O. Note
If worker pods run on the host network, this value may be inflated. For accurate readings, run worker pods on the container network. |
| Cache System Pod Memory Usage | Memory usage of master and worker pods. If worker pods use process memory as the cache medium, the cache capacity they consume is included in this figure. |
| Cache System Pod CPU Usage by Cores | CPU usage of master and worker pods. |
| Aggregated File Operation Requests | Request frequency of file metadata operations aggregated across the cache system. Only GetAttr and ReadDir operations are counted. |
FUSE metrics (via CSI)
These panels monitor FUSE pods injected via the CSI Driver. Use them when applications report high file-access latency or slow metadata operations.
| Panel | Description |
|---|---|
| FUSE Network I/O | Per-FUSE-pod network I/O. Note
If a FUSE pod runs on the host network, this value may be inflated. For accurate readings, run FUSE pods on the container network. |
| FUSE Memory Usage/Limit (%) | Percentage of current memory usage relative to the memory limit for each FUSE pod. Empty if no memory limit is set. |
| FUSE CPU Throttled Percent | Percentage of CPU throttling in each FUSE pod. Empty if no CPU limit is set. |
| Meta Ops Per Second | Per-second frequency of file metadata operations (GetAttr, ReadDir, Open) on each FUSE pod. |
| Meta Ops P99 Latency | P99 latency of metadata operations (GetAttr, ReadDir, Open) on each FUSE pod. |
| Read/Write Ops Per Second | Per-second frequency of file read and write operations on each FUSE pod. |
| Read/Write Ops P99 Latency | P99 latency of file read and write operations on each FUSE pod. |
FUSE metrics (via sidecar)
These panels monitor FUSE sidecar containers injected directly into application pods. The metrics are equivalent to the CSI variants, but scoped to sidecar containers.
| Panel | Description |
|---|---|
| FUSE Memory Usage/Limit (%) | Percentage of current memory usage relative to the memory limit for each FUSE sidecar container. Empty if no memory limit is set. |
| FUSE CPU Throttled Percent | Percentage of CPU throttling in each FUSE sidecar container. Empty if no CPU limit is set. |
| Meta Ops Per Second | Per-second frequency of metadata operations (GetAttr, ReadDir, Open) per FUSE sidecar container. |
| Meta Ops P99 Latency | P99 latency of metadata operations (GetAttr, ReadDir, Open) per FUSE sidecar container. |
| Read/Write Ops Per Second | Per-second frequency of file read and write operations per FUSE sidecar container. |
| Read/Write Ops P99 Latency | P99 latency of file read and write operations per FUSE sidecar container. |