All Products
Search
Document Center

Object Storage Service:Overview of data lake

Last Updated:Mar 20, 2026

A data lake is a centralized repository that stores semi-structured and unstructured data at any scale, in its raw format. Unlike traditional data warehouses that require data to be structured before ingestion, a data lake lets you store data first and apply schema on read—so you can use any analytics engine, from big data processing frameworks to real-time analytics tools to machine learning, without transforming data upfront.

Why OSS for data lake storage

The fundamental advantage of building a data lake on object storage is the decoupling of compute from storage. In traditional on-premises Hadoop or data warehouse setups, compute and storage are tightly coupled, making it difficult to scale either independently or optimize costs. With OSS, storage scales to exabytes without capacity provisioning, while compute clusters (MaxCompute, EMR, PAI, and others) scale independently based on workload demand.

OSS provides the following core advantages for data lake workloads:

  • Low-cost storage: Pay-as-you-go pricing with five tiered storage classes (Standard, Infrequent Access, Archive, Cold Archive, and Deep Cold Archive). Lifecycle rules automatically transition data to lower-cost classes as access frequency drops.

  • Elastic scalability: Exabyte-scale storage with no capacity provisioning required.

  • Ecosystem integration: Native integration with Alibaba Cloud compute services (MaxCompute, EMR, PAI) and open source analytics frameworks (Hadoop, Spark, Ray, PyTorch).

  • Security and compliance: Server-side encryption and granular access control to meet enterprise security requirements.

  • High availability: Cross-zone redundant storage and cross-region replication for data durability.

Architecture overview

Data lake architecture

The architecture covers the end-to-end flow from data collection to application:

  • Data sources: Ingests data in multiple formats (Parquet, CSV, JSON, multimedia files, and database and application data) from public cloud, Apsara Stack, hybrid cloud, and edge devices.

  • Storage: Stores data for big data and AI services in BucketGroups, using Object Storage Service (OSS) as the data lake storage solution. Resource pool Quality of Service (QoS) controls BucketGroup bandwidth to ensure efficient data access.

  • Access interfaces: Exposes data through SDKs, a POSIX file system, and a Hadoop Distributed File System (HDFS) compatible layer to support diverse compute frameworks.

  • Analytics and AI: Supports complex data exploration, machine learning model training, and real-time stream computing. Visualization tools help present insights.

Key considerations when building a data lake

Data collection and import

OSS supports ingesting data from any source, at any scale, in raw format—eliminating the need to define schemas or transformations upfront. Four import methods are available:

MethodWhen to use
Internal networkECS instances or other Alibaba Cloud services in the same region
Express ConnectData center to OSS over a private dedicated connection
Data Online Migration / Data TransportPetabyte-scale migrations from on-premises or other clouds
InternetDirect upload when no private connectivity exists
Important

Internet uploads introduce security risks. Before enabling internet access, configure custom domain binding, Block Public Access, and Referer-based access control.

Secure and cost-efficient storage

Data lakes accumulate data from mobile applications, IoT devices, social media, and the Internet of vehicles—most of which is accessed infrequently over time. OSS addresses this with three complementary features:

  • Five storage classes (Standard through Deep Cold Archive) to match storage cost with actual access frequency.

  • Lifecycle rules to automatically transition data to lower-cost classes as it ages.

  • Versioning to prevent accidental deletion.

Manage data across teams

In a data lake, multiple business teams often share a single bucket—storing data under different prefixes—or maintain separate buckets that need to exchange data. OSS provides three features to handle this:

  • Access points: Configure per-team data access permissions within a shared bucket.

  • Bucket inventory: Monitor storage usage by prefix or tag to track each team's footprint.

  • Data replication: Automatically synchronize data between buckets, within a region or across regions.

Performance management for concurrent workloads

A production data lake runs concurrent workloads—data collection, preprocessing, AI training, and debugging—that compete for the same storage bandwidth. OSS provides two features to manage this:

  • Resource pool QoS: Dynamically adjust bandwidth throttling per bucket or requester. Prioritize business-critical or compute-intensive jobs during peak periods.

  • OSS accelerator: Caches hot objects on high-performance NVMe SSDs to reduce read latency and increase queries per second (QPS). Particularly effective for high-QPS data warehouse queries, low-latency online business data, and repeated model pulls in AI inference.

Choose an access interface

Different compute frameworks access data differently. Choose the interface that matches your existing stack:

InterfaceBest for
OSS SDKCustom applications requiring high-performance, programmatic access. Supports mainstream languages. See multi-threaded bandwidth optimization for throughput tuning.
OSS connector for HadoopHadoop ecosystem workloads (MapReduce, Hive, Spark) already running on cloud object storage. Retains OSS's enterprise data management features.
OSS-HDFS serviceMigrating on-premises HDFS workloads to the cloud without modifying existing applications. Provides HDFS-compatible interfaces with stronger performance and elastic scalability than traditional HDFS. Integrated with EMR, Hadoop, and Spark. Note: some OSS native features are not available through HDFS interfaces. After migration, gradually adapt workloads to the OSS connector to take full advantage of OSS capabilities. For details, see Features of the OSS-HDFS service.
ossfs 2.0Modern applications (AI training, AI inference, autonomous driving simulation) with loose POSIX semantics requirements. Start here; downgrade to ossfs 1.0 if compatibility issues arise.
ossfs 1.0Legacy applications that require POSIX file system access and cannot be modified. Not a replacement for Alibaba Cloud File Storage NAS for applications requiring high POSIX compatibility.
OSS Connector for AI/MLPyTorch dataset workflows. Delivers optimal OSS dataset read performance without requiring OSS SDK knowledge.
ossutil 2.0Administrators and developers needing command-line data management.
ossbrowser 2.0Administrators and developers preferring a graphical interface for data management.