×
Community Blog OSS Tables: The Missing Piece in the AI Data Lakehouse Infrastructure

OSS Tables: The Missing Piece in the AI Data Lakehouse Infrastructure

OSS Tables brings native Apache Iceberg to object storage, unifying structured, vector, and unstructured data in one serverless, zero-ops AI data platform.

How Alibaba Cloud's new Table Bucket completes the multimodal storage trifecta for the AI Agent era


Introduction: The Multimodal Data Challenge

The AI landscape has fundamentally shifted. We are no longer building models that consume neat, tabular datasets. The AI Agent era demands a data infrastructure that can simultaneously handle high-resolution images, streaming video, unstructured documents, dense vector embeddings, and the structured metadata that ties it all together. For years, enterprises have been forced to stitch together disparate storage systems — object stores for blobs, vector databases for embeddings, and data warehouses for structured analytics — creating silos, escalating costs, and operational nightmares.1

What if a single storage platform could natively unify all three?

Alibaba Cloud Object Storage Service (OSS) officially launched Table Bucket, completing a product family that now includes Object Bucket, Vector Bucket, and Table Bucket. This is not merely a new feature; it is the architectural completion of an AI-native multimodal data foundation. OSS Tables brings native Apache Iceberg table semantics directly into object storage, delivering serverless, zero-ops structured data management at a scale and cost that traditional data lakes cannot match.
2

This deep dive explores what OSS Tables is, why it matters, how it works under the hood, and how it integrates with the broader Alibaba Cloud ecosystem to solve real-world AI data challenges.


What Is OSS Tables?

OSS Tables is a structured data lake storage service built on top of Alibaba Cloud OSS. It provides out-of-the-box, low-cost, high-performance structured data storage specifically designed for two dominant workloads:

  1. AI Training Data Management: Versioned, queryable metadata for massive training datasets (labels, annotations, experiment tracking).
  2. OLAP Analytics: High-throughput analytical queries over petabyte-scale structured data without provisioning dedicated warehouse infrastructure.

At its core, OSS Tables natively integrates the Apache Iceberg open table format. This means you get full ACID transactions, schema evolution, time travel, and partition evolution — all managed by the storage layer itself, not by a separate metastore or compute cluster.

Key Differentiators at a Glance

Capability Self-Managed Iceberg on OSS OSS Tables (Table Bucket)
Metadata Management Manual Hive/Glue/REST Catalog setup Native, fully managed REST Catalog
Small File Compaction User-managed Spark/Flink jobs Automatic, storage-layer compaction
Orphan File Cleanup Custom scripts, risk of data loss Built-in, safe unreferenced file removal
Snapshot Expiration Manual maintenance Automatic lifecycle policies
Concurrent Write TPS Baseline 10x+ improvement (500-table benchmark)
Compute Engine Compatibility Spark, Flink, Trino, etc. Same — via standard Iceberg REST Catalog API
Migration Effort N/A Zero code changes for existing Iceberg users
Billing Model Compute + Storage + Ops overhead Pure storage-tier pricing, serverless

The value proposition is clear: all the power of Apache Iceberg with none of the operational burden.


Architecture: The Three-Layer Hierarchy

OSS Tables introduces a clean, hierarchical data organization model that mirrors familiar database concepts while leveraging cloud-native resource isolation:

Table Bucket (Resource Container)
├── Namespace "ml_training" (Schema)
│   ├── Table "image_labels_v2"
│   ├── Table "experiment_runs"
│   └── Table "annotation_queue"
├── Namespace "analytics"
│   ├── Table "user_events"
│   └── Table "revenue_daily"
└── Namespace "iot_telemetry"
    └── Table "sensor_readings"

Table Bucket

The top-level resource container. Think of it as the equivalent of an S3 bucket, but purpose-built for structured tables. Each Table Bucket:

  • Has a unique ARN: acs:osstables:{region}:{uid}:bucket/{bucket-name}
  • Supports up to 10 per account per region
  • Defaults to local redundancy with optional AES256 server-side encryption
  • Automatically enables data maintenance policies upon creation

Namespace

A logical isolation layer analogous to a database schema. Each Table Bucket supports up to 10,000 namespaces, enabling fine-grained organizational boundaries for multi-team, multi-project environments. Namespaces enforce permission isolation and simplify access control.

Table

The actual data table, stored in Apache Iceberg format. Each table:

  • Has its own ARN for granular IAM policy attachment
  • Requires a defined schema at creation time
  • Supports rich data types: long, string, timestamptz, boolean, int, float, double, decimal(P,S), date, time, timestamp, uuid, binary, fixed[L]
  • Participates in automatic maintenance (compaction, snapshot cleanup, orphan removal)

Zero-Ops Data Maintenance: The Serverless Promise

The single biggest pain point in self-managed data lakes is maintenance. Small files accumulate from streaming writes, snapshots pile up from frequent commits, and orphaned files leak storage costs. Teams routinely dedicate 20-30% of their data engineering capacity to writing and scheduling maintenance jobs.

OSS Tables eliminates this entirely through three built-in, storage-layer mechanisms:

1. Automatic Small File Compaction

Streaming and micro-batch writes inevitably produce small Parquet files that degrade read performance. OSS Tables continuously monitors file sizes and automatically merges them into optimally-sized files — without consuming your compute resources or requiring Spark compaction jobs.

2. Snapshot Lifecycle Management

Every Iceberg commit creates a snapshot. Over time, expired snapshots consume metadata storage and slow down catalog operations. OSS Tables applies configurable retention policies (default: nonCurrentDays=10) to automatically expire old snapshots.

3. Unreferenced File Removal

When snapshots expire, the underlying data files may still exist on disk. OSS Tables safely identifies and removes files no longer referenced by any active snapshot (default: unreferencedDays=3). This is the safety net that prevents storage cost creep.

These mechanisms run transparently at the storage layer. Your compute clusters focus exclusively on queries and transformations, not janitorial work.


Open Standards, Zero Lock-In

Vendor lock-in is the elephant in the room for any cloud-native data platform. OSS Tables addresses this head-on through uncompromising adherence to open standards:

Apache Iceberg REST Catalog API Compatibility

OSS Tables exposes a fully compliant Iceberg REST Catalog endpoint. Any tool that speaks Iceberg REST — which now includes virtually every major engine — connects without modification:

  • Apache Spark: Read/write via spark.sql.catalog.oss_tables.type=rest
  • Apache Flink: Streaming sink/source with exactly-once semantics
  • Trino / Presto: Interactive ad-hoc queries
  • dbt: Transformation pipelines
  • Python (PyIceberg): Programmatic table management

Your existing Iceberg jobs migrate by changing a single catalog URL. No code refactoring. No proprietary SDK dependencies.

Open File Format

Data is stored as standard Parquet files within the Iceberg metadata structure. If you ever need to move, you can export the raw files and metadata to any S3-compatible storage. The format is yours, not Alibaba Cloud's.

Future-Proof: Lance Format on the Roadmap

Alibaba Cloud has announced plans to introduce Lance, a next-generation columnar format optimized for multimodal AI workloads (random access, vector search, zero-copy reads). This will complement Iceberg for scenarios where analytical SQL meets embedding retrieval, further solidifying OSS as a truly open, multi-format data platform.


The Three-Bucket Synergy: Object + Vector + Table

Table Bucket does not exist in isolation. Its true power emerges when combined with OSS's existing bucket types:

Bucket Type Data Modality Primary Use Case
Object Bucket Unstructured (images, video, documents) Raw asset storage, training data corpus
Vector Bucket Embeddings & similarity indices RAG retrieval, semantic search, recommendation
Table Bucket Structured (metadata, labels, events) Annotations, experiment tracking, OLAP analytics

Real-World Example: AI Training Pipeline

Consider an autonomous driving team training a perception model:

  1. Object Bucket: Stores 50PB of raw camera footage and LiDAR point clouds.
  2. Table Bucket: Manages frame-level annotations (bounding boxes, lane markings), dataset versioning, and experiment metadata. Engineers query SELECT * FROM annotations WHERE weather='rain' AND scene='highway' to curate training subsets.
  3. Vector Bucket: Holds CLIP embeddings of each frame for semantic deduplication and hard-negative mining during training.

All three buckets share:

  • Unified IAM: One set of RAM policies across all modalities
  • Consolidated Billing: Single invoice, cross-bucket cost visibility
  • Integrated Audit: Unified access logging and compliance reporting
  • Zero Data Movement: Metadata references objects by URI; no copying required

This is the multimodal data foundation that AI Agents actually need — not three separate products bolted together, but one coherent platform.


Real-Time Data Ingestion: Kafka Integration

Batch-only data lakes are increasingly insufficient. Modern AI pipelines require real-time streaming ingestion with strong consistency guarantees.

Alibaba Cloud Message Queue for Kafka has been deeply integrated with OSS Table Bucket:

  • Direct Kafka-to-Table Sink: Data flows from Kafka topics directly into Iceberg tables without intermediate compute layers (no Flink job required for simple ingestion).
  • Exactly-Once Semantics: Guaranteed zero data loss and zero duplication, critical for financial, IoT, and compliance-sensitive workloads.
  • Automatic Partition Management: Kafka partitions map cleanly to Iceberg partitions, optimizing downstream query performance.

This reduces real-time data lake architecture from a five-component pipeline (Kafka → Flink → HDFS/S3 → Compaction → Catalog) to a two-component flow (Kafka → OSS Tables), slashing both complexity and operational cost.


Performance Benchmarks

Early benchmarks from the preview release demonstrate significant advantages over self-managed alternatives:

Concurrent Write Throughput:
In a 500-table concurrent write scenario, OSS Tables achieves 10x+ higher TPS compared to a typical self-managed Iceberg deployment on comparable infrastructure. This is attributed to storage-layer optimizations in metadata management and write path parallelization.

Query Performance: Reads benefit from automatic compaction (no small-file penalty) and storage-tier caching. Performance is competitive with dedicated warehouse solutions for scan-heavy OLAP workloads.

Maintenance Overhead: Effectively zero. Teams report eliminating 4-8 hours/week of maintenance job monitoring and troubleshooting.

Note: Production performance may vary based on workload characteristics, region, and configuration.


CLI Example (ossutil)

# Create a namespace
ossutil tables-api create-namespace \
  --table-bucket-arn acs:osstables:cn-hangzhou:1234567890:bucket/my-table-bucket \
  --namespace ml_training

# Create a table
ossutil tables-api create-table \
  --table-bucket-arn acs:osstables:cn-hangzhou:1234567890:bucket/my-table-bucket \
  --namespace ml_training \
  --table image_labels \
  --schema '{"fields": [...]}'

Spark Connection Example

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.catalog.oss_tables", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.oss_tables.type", "rest") \
    .config("spark.sql.catalog.oss_tables.uri", "https://oss-tables-cn-hangzhou.aliyuncs.com") \
    .config("spark.sql.catalog.oss_tables.warehouse", "acs:osstables:cn-hangzhou:1234567890:bucket/my-table-bucket") \
    .getOrCreate()

# Query directly
df = spark.sql("SELECT * FROM oss_tables.ml_training.image_labels WHERE label = 'cat' LIMIT 100")
df.show()

Who Should Care?

Role Why OSS Tables Matters
Data Engineers Eliminate maintenance jobs. Focus on pipelines, not plumbing.
ML Engineers Versioned, queryable training metadata. Reproducible experiments.
Solution Architects Unified multimodal storage. Simplified security and governance.
CTOs / VP Engineering 10x write throughput. Zero vendor lock-in. Predictable serverless billing.
Platform Teams Multi-tenant namespace isolation. Centralized audit and compliance.

Looking Ahead

OSS Tables represents a strategic bet: that the future of data infrastructure is multimodal, open, and serverless. The roadmap includes:

  • Lance Format Support: Native multimodal table storage with hybrid vector + columnar retrieval

For enterprises building AI-native applications today, OSS Tables offers a compelling proposition: stop managing data lake infrastructure and start extracting value from your data. The storage layer should be invisible, intelligent, and infinitely scalable. With Table Bucket, Alibaba Cloud OSS takes a decisive step toward making that vision a reality.


Disclaimer: Pricing and feature availability are subject to change upon general availability. Always consult official Alibaba Cloud documentation for the latest information.

0 1 0
Share on

Justin See

12 posts | 1 followers

You may also like

Comments