OSS accelerator - NVMe SSD caching for low-latency data access - Object Storage Service

Benefits

Low latency

NVMe SSDs deliver millisecond-level download latency, improving performance for data warehouse queries and model inference downloads.
High IOPS

The accelerator provides high throughput for small data volumes and handles burst read demands for hot data.
Increased throughput

Bandwidth scales linearly with cache capacity, providing burst throughput up to hundreds of Gbit/s.
Automatic scaling

Scale cache capacity from 50 GB to 100 TB without service interruption to match periodic workload demands. The accelerator leverages OSS massive storage and can cache multiple data warehouse tables or partitions directly.
Decoupled storage and computing

The accelerator operates independently of compute servers, allowing you to adjust cache capacity and performance online without service interruption.
Data consistency

Unlike conventional caches, OSS accelerator ensures data consistency. When objects in OSS are updated, the accelerator automatically identifies and caches the latest versions so compute engines always read current data.
Multiple warmup policies

The accelerator automatically detects updated objects and provides the following warmup policies:
- Warmup during read: On a cache miss, the accelerator retrieves the data from the source bucket and caches it automatically.
- Synchronous warmup: Data is cached on the accelerator at write time.
- Asynchronous warmup: Batch-cache data from OSS to the accelerator on a configured schedule.
  Note
  - Warmup during read is always enabled and cannot be disabled.
  - Synchronous warmup and asynchronous warmup must be manually enabled. Both can be enabled simultaneously.

How it works

Each accelerator has an internal accelerated endpoint dedicated to its region. Access is limited to the internal network. For example, the endpoint for the China (Beijing) region is http://cn-beijing-internal.oss-data-acc.aliyuncs.com. Clients in the same VPC can use this endpoint to access cached resources.

Write requests
- Warmup during read: Write requests sent to the accelerated endpoint are forwarded to OSS buckets, similar to using standard OSS endpoints.
- Synchronous warmup: Write requests are forwarded to both OSS buckets and the accelerator.
- Asynchronous warmup: Hot data is pre-loaded to the accelerator before read requests arrive.
- Synchronous warmup + asynchronous warmup: Writes are forwarded to both OSS and the accelerator, and hot data is also pre-loaded before reads.
Read requests

Note
Read requests follow the same path regardless of the warmup policy.
1. Read requests sent to the accelerated endpoint are forwarded to the accelerator.
2. When the accelerator receives the read requests, the accelerator searches for the requested objects in the cache.
  - Cache hit: Objects are returned directly to the client.
  - Cache miss: The accelerator retrieves the objects from the mapped OSS bucket, caches them, and returns them to the client.
  - Cache full: The accelerator evicts less frequently accessed objects to prioritize hot data.

Scenarios

OSS accelerator suits scenarios that require high bandwidth and repeated data reads:

Low-latency data sharing

Background information

A customer scans goods in a vending machine, takes a picture, and uploads it via a mobile app. The backend stores the picture, then subsystems perform content moderation and barcode recognition. Results must be returned within milliseconds for fee deduction.
Solution

Use synchronous warmup to reduce image loading latency and shorten the transaction chain for latency-sensitive, repeated-read workloads.

Model inference

Background information

AI inference servers pull and load model objects for AIGC tasks. During debugging, servers frequently switch between models. As model sizes grow, pull and load times increase significantly.
Solution

Use asynchronous warmup when you can pre-determine hot model objects, or warmup during read when you cannot. With asynchronous warmup, use the accelerator SDK to pre-load known models. The accelerator automatically caches models on NVMe media for faster subsequent reads. Scale cache capacity at any time. If your inference server accesses OSS via a local directory, deploy ossfs.

Big data analysis

Background information

Business data is partitioned by day and stored in OSS. Analysts use engines like Hive or Spark for ad-hoc queries without knowing the exact data range in advance, and need faster query turnaround.
Solution

Use warmup during read for offline analytics with uncertain query ranges. Once Analyst A's data is cached, Analyst B's overlapping queries are automatically accelerated.

Multi-level acceleration

Background information

Client-side caching and server-side acceleration work together without conflict for multi-level acceleration.
Solution

Deploy Alluxio alongside compute clusters as a client-side cache. On Alluxio cache misses, reads fall through to the OSS accelerator (warmup during read). Alluxio evicts data based on TTL due to limited client host capacity. Since the OSS accelerator retains data (up to hundreds of TB) beyond Alluxio's TTL, subsequent Alluxio misses load directly from the accelerator, achieving two-level acceleration.

Metrics

Metric	Description
Cache capacity	After public preview: up to 100 TB During public preview: up to 500 GB If your business requires a greater cache capacity, submit a ticket.
Accelerator throughput	Throughput scales with cache capacity: up to 2.4 Gbit/s per TB of cache. This throughput is independent of standard OSS bandwidth limits. For more information about the standard bandwidth limits of OSS, see Limits and performance metrics. For example, if OSS provides 100 Gbit/s standard bandwidth in China (Shenzhen) and you create a 10 TB accelerator, you gain an additional 24 Gbit/s of low-latency throughput via the accelerated endpoint. Use the standard OSS internal endpoint for batch offline computing (100 Gbit/s concurrent block reads) and the accelerated endpoint for hot data queries (additional 24 Gbit/s low-latency access from NVMe SSDs).
Peak read bandwidth	Formula: MAX[600,300 × Cache capacity (TB)] MB/s MAX[] selects the larger value. The baseline bandwidth is 600 MB/s regardless of cache capacity. 300 × Cache capacity (TB) is the linearly scaling component. Example: A 2 TB (2,048 GB) accelerator provides 600 MB/s read bandwidth.
Maximum read bandwidth	40 GB/s If your business requires a greater read bandwidth, submit a ticket.
Minimum latency for reading 128 KB of data in a single request	<10 ms
Scale-up or scale-down interval	Once per hour
Scale-up or scale-down method	Manually scale up or scale down in the OSS console
Cache deletion policy	Uses the Least Recently Used (LRU) algorithm: frequently accessed data is retained, and stale data is evicted first to maximize cache utilization.

Billing rules

OSS accelerator is in public preview. During the public preview, up to 100 GB of cache capacity is free. After public preview ends, you are charged based on the actual cache capacity using pay-as-you-go billing.
When you use the accelerated endpoint to read or write data, OSS API calling fees apply even if no origin fetch requests are sent.

Note

For more information about how to query OSS billing data generated on an hourly basis, see Query hourly data of OSS and Query bills.

Next steps

For more information about how to create an accelerator and modify the cache capacity of an accelerator, see Create, modify, and delete accelerators.
For more information about how to configure and use the OSS accelerator feature together with OSS tools and OSS SDKs, see Use OSS accelerator.
For more information about the differences in performance when you access resources by using an OSS internal endpoint and an accelerator in specific business scenarios, see Performance metrics.