MaxCompute Data Lake Analytics - Lakehouse Architecture - MaxCompute

MaxCompute provides multiple data lake analytics solutions that bridge data lakes and data warehouses, combining lake flexibility and multi-engine ecosystems with warehouse enterprise-grade capabilities to build an integrated data management platform.

Data lake analytics and lakehouse open architecture

A data warehouse emphasizes governance of structured and semi-structured data. Strong management enables better compute performance and standardized governance.
A data lake emphasizes storage openness and format universality. It supports multiple engines producing or consuming data on demand, offers lightweight management to preserve flexibility, accommodates unstructured data, and supports schema-on-read.

MaxCompute introduces a lakehouse solution that combines the strengths of both.

Data lake analytics

MaxCompute connects to data lakes with the following capabilities:

High-performance, cost-effective batch compute that processes data collaboratively with other lake engines;
Extends warehouse management to the lake for better security and control;
Consolidate high-value data into the data warehouse;
Federated computation across the warehouse, data lakes, and databases.

Lakehouse open architecture

MaxCompute uses a storage-compute separation architecture. Its open storage, open metadata, and multi-engine design form a lakehouse open architecture with the following capabilities:

Metadata discovery and management on the lake;
A unified metadata view exposing both warehouse and MaxCompute-managed lake table metadata to external consumers;
Open storage supporting both MaxCompute and third-party engines;
Metadata and data services governing multi-engine read/write operations. These services control cross-engine permissions, coordinate writes, propagate metadata updates instantly across engines, enforce data-read rules (such as masking), and support automatic data maintenance (such as auto-compaction).

For more information, see Open lakehouse architecture.

MaxCompute data lake analytics features

MaxCompute offers data lake analytics capabilities arranged by management strength over external data, from weakest to strongest: Schemaless Query, external tables, and managed lake tables. Because metadata services also exist in DLF and filesystem catalog specifications, MaxCompute provides External Schema and External Project to map external metadata sources for analysis.

Schemaless Query

Feature: Schemaless Query lets MaxCompute SQL directly access data in OSS directories — Parquet, CSV, JSON — without predefining schema or partitions. It parses sample data to obtain metadata automatically (Parquet schemas, CSV headers, JSON structures). Results can be exported to OSS, written to internal tables, or used as subqueries.

External tables

External tables overview use DDL to define the name, schema, properties, permissions, location, and protocol for accessing data outside MaxCompute. This metadata lets the SQL engine connect to external sources, retrieve and update metadata, and read, compute, and write data.

Lake tables

MaxCompute provides lake tables to bring full management capabilities to lake data. Lake tables are built on OSS, Iceberg, MaxCompute metadata services, the Storage API, and open-source engine connectors.

Iceberg provides schema and partition information, enabling flexible schema evolution;
Metadata is stored in MaxCompute metadata service. All engines follow unified read rules, and write-triggered metadata updates are immediately visible to other engines;
Management includes unified permissions and file maintenance. For native openness, lake tables will also provide a native Iceberg REST Catalog metadata service and read-only access to Iceberg snapshot files stored directly on OSS.

For more information, see MaxCompute-managed Iceberg tables (beta). This feature was released for invitation-based preview in the Shanghai and Germany regions on May 7, 2026 (UTC+8).

External Schema and External Project

Unlike external tables, External Schema and External Project do not store metadata in MaxCompute — they fetch it in real time from the source. Users create a management object defining how MaxCompute connects to the source’s metadata service, data service, or database instance. MaxCompute then retrieves metadata through this object and maps all tables within the source’s Catalog, Database, or Schema.

Features and key concepts

Network connectivity
For more information, see Network connection process. MaxCompute accesses VPC data sources through network connections, such as EMR and RDS instances (coming soon). DLF, OSS, and Hologres are on the Alibaba Cloud interconnected network and require no Networklink configuration.
Network connectivity supports external tables, External Schemas, and External Projects accessing VPC data sources.
Foreign Server
A Foreign Server stores authentication credentials, location, and connection protocol details for a data source. MaxCompute uses it to connect to and access source metadata and data. It is a tenant-level object defined by the tenant administrator.
Foreign Servers support External Schemas and External Projects. They will gradually transition to Connection objects — shifting from tenant-level to data-level scope to support lake tables and External Schemas. External Projects that previously depended on Foreign Servers will store connection information directly. This transition is transparent to users.
External Schema
An External Schema is a schema within a MaxCompute project that maps to a source's Database level (DLF_legacy or Hive) or Schema level (Hologres), providing direct access to tables within that scope. Tables mapped through an External Schema rather than created in MaxCompute are called federated foreign tables (Mounted Tables).
Federated foreign tables fetch metadata in real time through the Foreign Server's metadata service — no DDL creation needed. Users reference source tables using the project and External Schema as namespace. Changes to source tables are immediately reflected. The mapped hierarchy spans from the Foreign Server level to the table level, determined by the authenticated identity's accessible scope.
External Project
In Data Lakehouse 1.0, External Projects used a two-layer model mapping to a source’s Database (DLF_legacy/Hive) or Schema (Hologres), requiring a warehouse project as the execution environment. This led to excessive External Projects. Since MaxCompute now uses a three-layer model matching external Catalog hierarchies, two-layer External Projects are being phased out. Existing users can migrate to External Schemas. For migration details, see: Migrate external projects to external schemas.
The new External Project maps to a source’s Catalog (DLF) or Database (Hologres), directly exposing Databases under a DLF Catalog or Schemas under a Hologres Database. This mapped layer is called a Mounted Schema. Tables within Mounted Schemas are accessed as federated foreign tables.

Data source type	Foreign Server hierarchy	External Schema mapping	External Project mapping	Legacy Data Lakehouse 1.0 External Project mapping	Authentication method
DLF_legacy+OSS	Region-level DLF and OSS services	DLF Catalog.Database	Not supported	DLF Catalog.Database	RAMRole
Hive+HDFS	E-MapReduce instance	Hive Database	Not supported	Hive Database	No authentication
Hologres	Database of a Hologres instance	Schema	—	Not supported	RAMRole
Hologres	Database of a Hologres instance	Not supported	Database	Not supported	SLR and current user identity authentication
DLF	Region-level DLF service	Not supported	DLF Catalog	Not supported	SLR and current user identity authentication
Filesystem Catalog	Paimon Catalog-level directory on OSS	Not supported	Catalog parsed from a Paimon Catalog-level directory	Not supported	RAMRole

Note

Different data sources support multiple authentication methods. MaxCompute will add more authentication methods in future releases, such as current user identity for Hologres and Kerberos for Hive.