×
Community Blog Alibaba Cloud AnalyticDB: The AI-Powered Data Lakehouse for a Unified AI+BI Experience

Alibaba Cloud AnalyticDB: The AI-Powered Data Lakehouse for a Unified AI+BI Experience

This article introduces Alibaba Cloud AnalyticDB as an AI-powered, cloud-native data lakehouse that unifies AI and BI workflows.

Introduction

Alibaba Cloud AnalyticDB for MySQL is a cloud-native data warehous. Born in high performance in the era of real-time data warehouses, you can efficiently process and analyze petabytes of structured data. In recent years, to embrace the big data wave, AnalyticDB expanded from a traditional data warehouse to a data lakehouse. It added support for lake formats like Paimon, Iceberg, Delta Lake, and Hudi, providing database-level performance, reliability, and manageability for open data lakes. This solidified its foundation as a unified lakehouse, empowering large-scale, SQL-centric data processing and BI analytics.

However, the recent explosion of AI applications has created more diverse demands for data storage, management, and development, such as native storage and processing of unstructured data and the construction of workflows for ML, model fine-tuning, and inference. Traditional data lakehouses can no longer meet these needs and face several key pain points:

Difficulty Processing Unstructured and Vector Data: Unstructured data is often stored separately from its metadata, leading to data consistency issues. Accessing this data requires reading the path first, then the original file, resulting in secondary I/O overhead. For vectors, there is no direct and efficient way to perform approximate nearest neighbor search, forcing reliance on external vector databases.

I/O Bottlenecks in Data Access: Machine learning scenarios require fast, small-batch random access. Traditional data lakehouses are optimized for full scans and large-batch sequential reads, which are suitable for aggregation operations like JOIN, but cause severe read I/O amplification and high latency in ML contexts.

Inefficient Heterogeneous Resource Scheduling: Tasks in AI development workflows often use both CPU and GPU resources simultaneously. Traditional data lakehouses cannot coordinate the scheduling of these heterogeneous resources across the entire pipeline, leading to low resource utilization.

Fragmented Python Ecosystem: Traditional data lakehouses are primarily centered around SQL and DataFrame APIs, lacking native Python support. For instance, when passing a DataFrame to a PyTorch DataLoader, the data must cross two distinct technology ecosystems (JVM and Python), incurring extra serialization and deserialization overhead..

Complex Multimodal Data Management: Storing structured data, unstructured data like images, and vector data separately makes management complex and associative access difficult.

Driven by these AI trends and pain points, AnalyticDB gradually evolved from a data lake warehouse into an AI-powered data lake warehouse. It has built an AI pipeline with AnalyticDB Ray/Spark+Lance as the core. By adopting the AI-native Lance storage format, which is based on Apache Arrow, data can be read and written directly by PyTorch and TensorFlow without data copying. Furthermore, by deeply integrating the two major open-source computing engines, Ray and Spark, it provides a comprehensive product solution that covers everything from data processing to distributed ML, model fine-tuning, and inference. This breaks down the barriers between AI and BI, creating the shortest path from data to intelligence.

Traditional Data Lake Warehouse AI Data Lake Warehouse
Scenarios Data analysis, such as ETL, ad-hoc queries, and BI report In addition to BI scenarios, it also supports AI development, such as RAG, feature engineering, reinforcement learning, model fine-tuning, and model inference.
Target Users Data analysts and data engineers Data professionals with AI requirements
Data Types and Storage Structured and semi-structured data (tables, logs, JSON), primarily in columnar formats like Parquet. For unstructured data, only metadata is stored; raw data relies on external storage. Multimodal data. In addition to the above, provides native support for unstructured data (text, images, audio) and vectors.
Data Access Patterns Optimized for full scans and large-batch sequential reads, suitable for aggregation operations like GROUP BY and JOIN. Optimized for high-speed random access and small-batch random reads, suitable for batch reading and shuffling in ML.
Typical Workloads Primarily SQL and DataFrame-based, with a focus on CPU optimization. Primarily native Python development, with a focus on GPU optimization.
AI/ML Support Requires external tools or platforms and often involves moving data out of the lakehouse. Built-in AI development capabilities with a unified toolchain and environment, enabling BI analytics, ML, model deployment, and inference on a single copy of data.

AnalyticDB AI Data Lake Warehouse Solution

AnalyticDB Ray + Lance: Building an AI Pipeline from Data to Models

The AnalyticDB storage layer integrates multi-modal data ETL with ML by introducing AI-native Lance data format and the AnalyticDB Ray and AnalyticDB Spark computing engines hosted by the upper layer. This allows you to build and optimize AI pipelines more efficiently.

1

Introducing AnalyticDB Ray

Ray is an open-source distributed computing framework designed for AI and high-performance computing. It abstracts distributed scheduling with a simple API, allowing single-machine tasks to be scaled to thousand-node clusters with just a few lines of code. It includes built-in modules like Ray Tune, Ray Train, and Ray Serve, and is seamlessly compatible with the TensorFlow and PyTorch ecosystems. With features covering the entire AI development lifecycle and flexible scheduling of heterogeneous resources, it is ideal for business scenarios like multimodal processing, search and recommendation, financial risk control, and graph computing.

AnalyticDB Ray is a fully managed Ray service offered by AnalyticDB for MySQL. Building on the rich open-source Ray ecosystem, it provides full-stack enhancements to the Ray kernel and service capabilities. Developers' applications can benefit from the cost-performance optimizations of AnalyticDB Ray without worrying about cluster operations and maintenance, while seamlessly integrating with the AnalyticDB lakehouse to build a unified Data + AI architecture.

Introducing Lance on AnalyticDB Lake Storage

Lance is an open-source storage format specifically optimized for multimodal data. It is designed to store large-scale unstructured data (like images, videos, and embedding vectors) alongside structured data (like metadata and labels). It aims to solve the performance bottlenecks encountered by traditional formats like Parquet when handling multimodal data. Lance's core objective is to provide efficient data storage and processing capabilities for AI applications, making it especially suitable for scenarios that require processing large amounts of images, text, and embedding vectors. Its key features include high-performance random point lookups, vector search, zero-copy schema evolution, and rich ecosystem integration.

Lance on AnalyticDB Lake Storage is the integration of Lance into AnalyticDB Lake Storage. In addition to managed storage, Lance is also designed to provide continuously optimized AI-oriented data processing and application solutions.

Data+AI Zero ETL: Through APS (AnalyticDB Pipeline Service), an internal service, AnalyticDB Lake Storage automatically detects incremental data files. For example, it can automatically perform frame extraction on newly added video files and call Alibaba Cloud's Bailian service or external services to generate image descriptions and image-text vector embeddings. Users can also develop custom operators. Additionally, a built-in data format conversion service allows one-click conversion of existing multimodal data (images, text, audio, video) to the Lance format, eliminating the need to manually write complex ETL logic and enabling rapid construction of AI data processing pipelines.

Deep Integration of AnalyticDB Ray/Spark + Lance: The Lance data format breaks down the barriers between Python-centric AI development (Ray) and SQL-centric data engineering (Spark). Data engineers can use Spark for large-scale ETL and feature engineering, writing AI-ready data into Lance. AI scientists can then use Ray to directly perform subsequent ML and vector retrieval tasks on the same data with zero copying.

The Future is Bright: Moving Toward a Zero-Ops Multimodal Data Lake: Continuing the evolution of its data lake foundation, AnalyticDB is also building services for Lance-based file ingestion, metadata management, and automatic compaction, making multimodal data easy to discover, manage, and use.

Practical Application Scenarios

Finally, let's introduce some use cases and solutions for AnalyticDB Ray + Lance.

Mixed Graphic and Text Storage

2

Solution: In multimodal scenarios, storing images together with related text descriptions, labels, and IDs is common. AnalyticDB Spark is used to process and merge the image and text data. The resulting DataFrame, containing all information (image binaries, text, IDs, etc.), is then written directly to the lake storage in Lance format.

Value:

• It solves the data integrity and consistency problems of traditional separate storage solutions by keeping images and their metadata in the same file for easy management.

• Efficient Reading. The traditional solution is to store images in a table the url, which requires additional I /O to read the image data when accessing the data. Lance's multi-mode storage can read metadata and images at the same time at a time, reducing I /O operations and path search time, especially suitable for scenarios where a large number of images and data are processed at a time (such as Machine Learning Platform for AI datasets). In customer scenario testing, processing performance showed a 2-4 times improvement compared to using Parquet.

Image Tagging and Fine-Tuning

3

Solution: Image tagging—adding one or more descriptive labels to raw images—is a common operation in AI. AnalyticDB Ray Data is used to load the source data and convert it to Lance format. A tagging inference service is deployed using Ray Serve, and the new tags are added as a new column in the Lance dataset. Furthermore, a fine-tuning framework like LLaMA Factory can be deployed for subsequent development.

Value:

• Leveraging Lance's zero-cost schema evolution feature, columns can be added efficiently without rewriting the entire data file—only a small new file is needed. In customer tests, this showed a 3 times performance improvement compared to using Parquet.

• The AnalyticDB Ray toolchain provides a one-stop solution for data processing, tagging, and fine-tuning.

AnalyticDB Ray Pipeline for Streaming, High-Concurrency Scheduling

4

Solution: In autonomous vehicle data scenarios, petabytes of multimodal clip data (video, point clouds, radar, GPS, vehicle control signals) must be processed. The AnalyticDB Ray Pipeline is used for streaming video file segmentation and tagging, with the processed data stored in Lance format in the lake storage. Based on the tagged data, business logic can trigger the Ray Pipeline via Airflow to perform secondary processing on the Lance data as needed.

Value:

• Compared to the inefficient heterogeneous resource usage of traditional schedulers, AnalyticDB Ray allows CPU and GPU tasks at different stages to run in parallel, eliminating wait times, reducing resource idleness, and maximizing processing throughput. Combined with profile-based, fine-grained scheduling of heterogeneous CPU and GPU resources, GPU utilization can be increased to over 90%.

• Task scheduling throughput can reach 400+ task per second, and it scales linearly as more resources are added.

0 1 0
Share on

ApsaraDB

562 posts | 178 followers

You may also like

Comments

ApsaraDB

562 posts | 178 followers

Related Products