How Alibaba Cloud's new Table Bucket completes the multimodal storage trifecta for the AI Agent era
The AI landscape has fundamentally shifted. We are no longer building models that consume neat, tabular datasets. The AI Agent era demands a data infrastructure that can simultaneously handle high-resolution images, streaming video, unstructured documents, dense vector embeddings, and the structured metadata that ties it all together. For years, enterprises have been forced to stitch together disparate storage systems — object stores for blobs, vector databases for embeddings, and data warehouses for structured analytics — creating silos, escalating costs, and operational nightmares.
What if a single storage platform could natively unify all three?
Alibaba Cloud Object Storage Service (OSS) officially launched Table Bucket, completing a product family that now includes Object Bucket, Vector Bucket, and Table Bucket. This is not merely a new feature; it is the architectural completion of an AI-native multimodal data foundation. OSS Tables brings native Apache Iceberg table semantics directly into object storage, delivering serverless, zero-ops structured data management at a scale and cost that traditional data lakes cannot match.
This deep dive explores what OSS Tables is, why it matters, how it works under the hood, and how it integrates with the broader Alibaba Cloud ecosystem to solve real-world AI data challenges.
OSS Tables is a structured data lake storage service built on top of Alibaba Cloud OSS. It provides out-of-the-box, low-cost, high-performance structured data storage specifically designed for two dominant workloads:
At its core, OSS Tables natively integrates the Apache Iceberg open table format. This means you get full ACID transactions, schema evolution, time travel, and partition evolution — all managed by the storage layer itself, not by a separate metastore or compute cluster.
| Capability | Self-Managed Iceberg on OSS | OSS Tables (Table Bucket) |
|---|---|---|
| Metadata Management | Manual Hive/Glue/REST Catalog setup | Native, fully managed REST Catalog |
| Small File Compaction | User-managed Spark/Flink jobs | Automatic, storage-layer compaction |
| Orphan File Cleanup | Custom scripts, risk of data loss | Built-in, safe unreferenced file removal |
| Snapshot Expiration | Manual maintenance | Automatic lifecycle policies |
| Concurrent Write TPS | Baseline | 10x+ improvement (500-table benchmark) |
| Compute Engine Compatibility | Spark, Flink, Trino, etc. | Same — via standard Iceberg REST Catalog API |
| Migration Effort | N/A | Zero code changes for existing Iceberg users |
| Billing Model | Compute + Storage + Ops overhead | Pure storage-tier pricing, serverless |
The value proposition is clear: all the power of Apache Iceberg with none of the operational burden.
OSS Tables introduces a clean, hierarchical data organization model that mirrors familiar database concepts while leveraging cloud-native resource isolation:
Table Bucket (Resource Container)
├── Namespace "ml_training" (Schema)
│ ├── Table "image_labels_v2"
│ ├── Table "experiment_runs"
│ └── Table "annotation_queue"
├── Namespace "analytics"
│ ├── Table "user_events"
│ └── Table "revenue_daily"
└── Namespace "iot_telemetry"
└── Table "sensor_readings"
The top-level resource container. Think of it as the equivalent of an S3 bucket, but purpose-built for structured tables. Each Table Bucket:
acs:osstables:{region}:{uid}:bucket/{bucket-name}
A logical isolation layer analogous to a database schema. Each Table Bucket supports up to 10,000 namespaces, enabling fine-grained organizational boundaries for multi-team, multi-project environments. Namespaces enforce permission isolation and simplify access control.
The actual data table, stored in Apache Iceberg format. Each table:
long, string, timestamptz, boolean, int, float, double, decimal(P,S), date, time, timestamp, uuid, binary, fixed[L]
The single biggest pain point in self-managed data lakes is maintenance. Small files accumulate from streaming writes, snapshots pile up from frequent commits, and orphaned files leak storage costs. Teams routinely dedicate 20-30% of their data engineering capacity to writing and scheduling maintenance jobs.
OSS Tables eliminates this entirely through three built-in, storage-layer mechanisms:
Streaming and micro-batch writes inevitably produce small Parquet files that degrade read performance. OSS Tables continuously monitors file sizes and automatically merges them into optimally-sized files — without consuming your compute resources or requiring Spark compaction jobs.
Every Iceberg commit creates a snapshot. Over time, expired snapshots consume metadata storage and slow down catalog operations. OSS Tables applies configurable retention policies (default: nonCurrentDays=10) to automatically expire old snapshots.
When snapshots expire, the underlying data files may still exist on disk. OSS Tables safely identifies and removes files no longer referenced by any active snapshot (default: unreferencedDays=3). This is the safety net that prevents storage cost creep.
These mechanisms run transparently at the storage layer. Your compute clusters focus exclusively on queries and transformations, not janitorial work.
Vendor lock-in is the elephant in the room for any cloud-native data platform. OSS Tables addresses this head-on through uncompromising adherence to open standards:
OSS Tables exposes a fully compliant Iceberg REST Catalog endpoint. Any tool that speaks Iceberg REST — which now includes virtually every major engine — connects without modification:
spark.sql.catalog.oss_tables.type=rest
Your existing Iceberg jobs migrate by changing a single catalog URL. No code refactoring. No proprietary SDK dependencies.
Data is stored as standard Parquet files within the Iceberg metadata structure. If you ever need to move, you can export the raw files and metadata to any S3-compatible storage. The format is yours, not Alibaba Cloud's.
Alibaba Cloud has announced plans to introduce Lance, a next-generation columnar format optimized for multimodal AI workloads (random access, vector search, zero-copy reads). This will complement Iceberg for scenarios where analytical SQL meets embedding retrieval, further solidifying OSS as a truly open, multi-format data platform.
Table Bucket does not exist in isolation. Its true power emerges when combined with OSS's existing bucket types:
| Bucket Type | Data Modality | Primary Use Case |
|---|---|---|
| Object Bucket | Unstructured (images, video, documents) | Raw asset storage, training data corpus |
| Vector Bucket | Embeddings & similarity indices | RAG retrieval, semantic search, recommendation |
| Table Bucket | Structured (metadata, labels, events) | Annotations, experiment tracking, OLAP analytics |
Consider an autonomous driving team training a perception model:
SELECT * FROM annotations WHERE weather='rain' AND scene='highway' to curate training subsets.All three buckets share:
This is the multimodal data foundation that AI Agents actually need — not three separate products bolted together, but one coherent platform.
Batch-only data lakes are increasingly insufficient. Modern AI pipelines require real-time streaming ingestion with strong consistency guarantees.
Alibaba Cloud Message Queue for Kafka has been deeply integrated with OSS Table Bucket:
This reduces real-time data lake architecture from a five-component pipeline (Kafka → Flink → HDFS/S3 → Compaction → Catalog) to a two-component flow (Kafka → OSS Tables), slashing both complexity and operational cost.
Early benchmarks from the preview release demonstrate significant advantages over self-managed alternatives:
Concurrent Write Throughput:
In a 500-table concurrent write scenario, OSS Tables achieves 10x+ higher TPS compared to a typical self-managed Iceberg deployment on comparable infrastructure. This is attributed to storage-layer optimizations in metadata management and write path parallelization.
Query Performance: Reads benefit from automatic compaction (no small-file penalty) and storage-tier caching. Performance is competitive with dedicated warehouse solutions for scan-heavy OLAP workloads.
Maintenance Overhead: Effectively zero. Teams report eliminating 4-8 hours/week of maintenance job monitoring and troubleshooting.
Note: Production performance may vary based on workload characteristics, region, and configuration.
# Create a namespace
ossutil tables-api create-namespace \
--table-bucket-arn acs:osstables:cn-hangzhou:1234567890:bucket/my-table-bucket \
--namespace ml_training
# Create a table
ossutil tables-api create-table \
--table-bucket-arn acs:osstables:cn-hangzhou:1234567890:bucket/my-table-bucket \
--namespace ml_training \
--table image_labels \
--schema '{"fields": [...]}'
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.sql.catalog.oss_tables", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.oss_tables.type", "rest") \
.config("spark.sql.catalog.oss_tables.uri", "https://oss-tables-cn-hangzhou.aliyuncs.com") \
.config("spark.sql.catalog.oss_tables.warehouse", "acs:osstables:cn-hangzhou:1234567890:bucket/my-table-bucket") \
.getOrCreate()
# Query directly
df = spark.sql("SELECT * FROM oss_tables.ml_training.image_labels WHERE label = 'cat' LIMIT 100")
df.show()
| Role | Why OSS Tables Matters |
|---|---|
| Data Engineers | Eliminate maintenance jobs. Focus on pipelines, not plumbing. |
| ML Engineers | Versioned, queryable training metadata. Reproducible experiments. |
| Solution Architects | Unified multimodal storage. Simplified security and governance. |
| CTOs / VP Engineering | 10x write throughput. Zero vendor lock-in. Predictable serverless billing. |
| Platform Teams | Multi-tenant namespace isolation. Centralized audit and compliance. |
OSS Tables represents a strategic bet: that the future of data infrastructure is multimodal, open, and serverless. The roadmap includes:
For enterprises building AI-native applications today, OSS Tables offers a compelling proposition: stop managing data lake infrastructure and start extracting value from your data. The storage layer should be invisible, intelligent, and infinitely scalable. With Table Bucket, Alibaba Cloud OSS takes a decisive step toward making that vision a reality.
Disclaimer: Pricing and feature availability are subject to change upon general availability. Always consult official Alibaba Cloud documentation for the latest information.
12 posts | 1 followers
FollowAlibaba Cloud Big Data and AI - December 29, 2025
Apache Flink Community - February 24, 2025
Apache Flink Community - April 30, 2024
Alibaba EMR - August 5, 2024
ApsaraDB - March 20, 2026
Apache Flink Community - July 28, 2025
12 posts | 1 followers
Follow
OSS(Object Storage Service)
An encrypted and secure cloud storage service which stores, processes and accesses massive amounts of data from anywhere in the world
Learn More
Data Lake Storage Solution
Build a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalability
Learn More
Storage Capacity Unit
Plan and optimize your storage budget with flexible storage services
Learn More
Simple Log Service
An all-in-one service for log-type data
Learn MoreMore Posts by Justin See