Overview of data lake - Object Storage Service - Alibaba Cloud Documentation Center

A data lake is a centralized repository that stores semi-structured and unstructured data at any scale, in its raw format. Unlike traditional data warehouses that require data to be structured before ingestion, a data lake lets you store data first and apply schema on read—so you can use any analytics engine, from big data processing frameworks to real-time analytics tools to machine learning, without transforming data upfront.

Why OSS for data lake storage

The fundamental advantage of building a data lake on object storage is the decoupling of compute from storage. In traditional on-premises Hadoop or data warehouse setups, compute and storage are tightly coupled, making it difficult to scale either independently or optimize costs. With OSS, storage scales to exabytes without capacity provisioning, while compute clusters (MaxCompute, EMR, PAI, and others) scale independently based on workload demand.

OSS provides the following core advantages for data lake workloads:

Low-cost storage: Pay-as-you-go pricing with five tiered storage classes (Standard, Infrequent Access, Archive, Cold Archive, and Deep Cold Archive). Lifecycle rules automatically transition data to lower-cost classes as access frequency drops.
Elastic scalability: Exabyte-scale storage with no capacity provisioning required.
Ecosystem integration: Native integration with Alibaba Cloud compute services (MaxCompute, EMR, PAI) and open source analytics frameworks (Hadoop, Spark, Ray, PyTorch).
Security and compliance: Server-side encryption and granular access control to meet enterprise security requirements.
High availability: Cross-zone redundant storage and cross-region replication for data durability.

Architecture overview

The architecture covers the end-to-end flow from data collection to application:

Data sources: Ingests data in multiple formats (Parquet, CSV, JSON, multimedia files, and database and application data) from public cloud, Apsara Stack, hybrid cloud, and edge devices.
Storage: Stores data for big data and AI services in BucketGroups, using Object Storage Service (OSS) as the data lake storage solution. Resource pool Quality of Service (QoS) controls BucketGroup bandwidth to ensure efficient data access.
Access interfaces: Exposes data through SDKs, a POSIX file system, and a Hadoop Distributed File System (HDFS) compatible layer to support diverse compute frameworks.
Analytics and AI: Supports complex data exploration, machine learning model training, and real-time stream computing. Visualization tools help present insights.

Key considerations when building a data lake

Data collection and import

OSS supports ingesting data from any source, at any scale, in raw format—eliminating the need to define schemas or transformations upfront. Four import methods are available:

Method	When to use
Internal network	ECS instances or other Alibaba Cloud services in the same region
Express Connect	Data center to OSS over a private dedicated connection
Data Online Migration / Data Transport	Petabyte-scale migrations from on-premises or other clouds
Internet	Direct upload when no private connectivity exists

Important

Internet uploads introduce security risks. Before enabling internet access, configure custom domain binding, Block Public Access, and Referer-based access control.

Secure and cost-efficient storage

Data lakes accumulate data from mobile applications, IoT devices, social media, and the Internet of vehicles—most of which is accessed infrequently over time. OSS addresses this with three complementary features:

Five storage classes (Standard through Deep Cold Archive) to match storage cost with actual access frequency.
Lifecycle rules to automatically transition data to lower-cost classes as it ages.
Versioning to prevent accidental deletion.

Manage data across teams

In a data lake, multiple business teams often share a single bucket—storing data under different prefixes—or maintain separate buckets that need to exchange data. OSS provides three features to handle this:

Access points: Configure per-team data access permissions within a shared bucket.
Bucket inventory: Monitor storage usage by prefix or tag to track each team's footprint.
Data replication: Automatically synchronize data between buckets, within a region or across regions.

Performance management for concurrent workloads

A production data lake runs concurrent workloads—data collection, preprocessing, AI training, and debugging—that compete for the same storage bandwidth. OSS provides two features to manage this:

Resource pool QoS: Dynamically adjust bandwidth throttling per bucket or requester. Prioritize business-critical or compute-intensive jobs during peak periods.
OSS accelerator: Caches hot objects on high-performance NVMe SSDs to reduce read latency and increase queries per second (QPS). Particularly effective for high-QPS data warehouse queries, low-latency online business data, and repeated model pulls in AI inference.

Choose an access interface

Different compute frameworks access data differently. Choose the interface that matches your existing stack:

Interface	Best for
OSS SDK	Custom applications requiring high-performance, programmatic access. Supports mainstream languages. See multi-threaded bandwidth optimization for throughput tuning.
OSS connector for Hadoop	Hadoop ecosystem workloads (MapReduce, Hive, Spark) already running on cloud object storage. Retains OSS's enterprise data management features.
OSS-HDFS service	Migrating on-premises HDFS workloads to the cloud without modifying existing applications. Provides HDFS-compatible interfaces with stronger performance and elastic scalability than traditional HDFS. Integrated with EMR, Hadoop, and Spark. Note: some OSS native features are not available through HDFS interfaces. After migration, gradually adapt workloads to the OSS connector to take full advantage of OSS capabilities. For details, see Features of the OSS-HDFS service.
ossfs 2.0	Modern applications (AI training, AI inference, autonomous driving simulation) with loose POSIX semantics requirements. Start here; downgrade to ossfs 1.0 if compatibility issues arise.
ossfs 1.0	Legacy applications that require POSIX file system access and cannot be modified. Not a replacement for Alibaba Cloud File Storage NAS for applications requiring high POSIX compatibility.
OSS Connector for AI/ML	PyTorch dataset workflows. Delivers optimal OSS dataset read performance without requiring OSS SDK knowledge.
ossutil 2.0	Administrators and developers needing command-line data management.
ossbrowser 2.0	Administrators and developers preferring a graphical interface for data management.