All Products
Search
Document Center

Data Lake Formation:What is Data Lake Formation?

Last Updated:Mar 26, 2026

Data Lake Formation (DLF) helps you govern, secure, and optimize your data lakehouse on Alibaba Cloud. DLF provides a unified, fully managed platform for metadata management, storage, and access control, integrating with Alibaba Cloud big data analytics engines and AI products to let you build cloud-native data lakes and OpenLake solutions without managing separate metadata or permission systems.

What is a data lake?

A data lake is a centralized repository that stores structured and unstructured data at any scale. Unlike a data warehouse, a data lake preserves data in its raw or native format, making it accessible to a wide range of analytics and machine learning workloads.

A data lakehouse extends this model by adding warehouse-style metadata management and fine-grained access control on top of a data lake's flexible storage — combining the best of both architectures.

How it works

DLF acts as the control plane for your data lakehouse:

  1. Register your metadata. Connect your storage to DLF and manage metadatabases from the console or API.

  2. Set permissions once. Define access control at the Catalog, database, or table level. All integrated compute engines enforce the same rules automatically.

  3. Keep storage healthy. Schedule optimization tasks — file compaction, expired snapshot cleanup, expired partition cleanup, and orphaned file cleanup — to reduce costs and maintain query performance.

Features

Unified metadata and storage

DLF provides a single set of lakehouse metadata and storage shared across all integrated compute engines. Data flows between products without manual synchronization or schema duplication.

Unified permission management

Define access control once at the Catalog, database, or table level. Every integrated service enforces the same permissions, eliminating separate permission configurations per engine.

Storage optimization

DLF automates lakehouse table maintenance through configurable strategies:

Strategy Effect
File compaction Merges small files to improve scan performance
Expired snapshot cleanup Removes outdated snapshots to reclaim storage
Expired partition cleanup Deletes data from expired partitions
Orphaned file cleanup Removes files no longer referenced by any table version

Comprehensive ecosystem

DLF deeply integrates with Alibaba Cloud stream and batch processing engines and AI products, providing an out-of-the-box experience that simplifies operations.

Architecture

image

DLF's architecture has three layers:

Layer Capabilities
Metadata management View and manage metadatabases from the console, create new metadatabases, and integrate with third-party applications.
Permission management Enforce access control at three levels: Catalog, database, and table.
Storage optimization Run lakehouse table optimization strategies to lower storage costs and improve query efficiency.

Benefits

Benefit Description
Fully managed Available out-of-the-box with no infrastructure to provision or maintain. Supports the full data lifecycle with unified Paimon metadata and storage management.
Enterprise-level security Dual control over API and data permissions across multiple abstraction levels keeps data secure and compliant.
Flexible optimization strategies Configurable file compaction and data cleanup strategies improve access performance and lower storage costs.
Rich ecosystem Built on deep Paimon integration, DLF connects your data lakehouse to Alibaba Cloud compute engines and AI products through a single managed service.

Use cases

Data lakehouse

A data lakehouse handles diverse data types while delivering high-performance analytics. Use DLF to process large volumes of historical and real-time data and share them as a governed resource across teams — each with its own access controls — while maintaining robust data security.

Traditional big data

DLF supports common big data workloads: offline big data analysis, real-time analysis, machine learning, and log file analysis. Unified metadata and storage management simplify building and governing your data lake from day one.