Unlocking In-Warehouse AI Pipelines: How AnalyticDB Ray Streamlines Development and O&M with Integrated Multimodal ETL+ML

This article introduces AnalyticDB Ray, a managed Ray service that integrates multimodal ETL and ML to streamline in-warehouse AI pipeline development and O&M.

Introduction

In today's data-driven era, the processing and analysis of multimodal data, including text, image, audio, video and other data types, is becoming increasingly important. By integrating multi-modal data ETL with ML (Machine Learning Platform for AI), AI pipelines can be built and optimized more efficiently, enabling a seamless transition from data to intelligent decisions. This article describes the fully managed Ray service provided by Alibaba Cloud AnalyticDB for MySQL, a cloud-native data warehouse. This service unlocks the potential of AI pipelines in data warehouses and seamlessly integrates multi-modal data ETL with ML.

1. Open-Source Ray: The Cornerstone of Distributed Computing in the AI Era

Open-source Ray is a distributed computing framework designed specifically for AI and High Performance Computing. It originated from UC Berkeley's AMPLab, the same lab that produced the Apache Spark project. With its simple API, Ray abstracts distributed scheduling with a concise API. With only a few lines of code, you can scale a stand-alone task to a thousand-node cluster and schedule remote resources like calling a local function. Built-in modules such as Ray Tune, Ray Train, and Ray Serve are seamlessly compatible with the TensorFlow and PyTorch ecosystems to support scenarios such as reinforcement learning and big data processing. The active open source community and support from enterprises such as Anyscale make it a great tool for building AI applications quickly.

Core Highlights of Ray:

● A Unified Framework for All Distributed Computing Scenarios

Heterogeneous Scheduling: Supports hybrid, elastic scheduling of CPU, GPU, and FPGA resources.
Full-Stack Workloads: Handles the entire data and AI pipeline, including data preprocessing, model inference, and fine-tuning, as well as distributed execution of Python tasks.
Framework Compatibility: Integrates with mainstream ecosystems like Spark, TensorFlow/PyTorch, and Hugging Face.
Broad Scenario Coverage: Powers core business scenarios such as multimodal processing, search and recommendation, financial risk control, and graph computing.

● Dynamic Resource Scheduling & Efficient Execution: Enables fine-grained, elastic resource scheduling, allocating CPU, GPU, memory, and custom resources on demand. It supports efficient data exchange through formats like Apache Arrow and TensorFlow Datasets to accelerate data processing.

● Multi-Cloud and Large-Scale Scalability: Supports containerized deployment via Kubernetes, Docker Swarm, and others, allowing seamless use of multi-cloud resources. It is ideal for EB-scale data processing and handling models with hundreds of billions of parameters.

2. AnalyticDB Ray: A Lightweight, One-Stop Data+AI Service

Open-source Ray provides a highly flexible distributed computing framework for developers. In the actual production environment, enterprises often face problems such as distributed job optimization, fine-grained resource scheduling, cluster O&M, stability, and high availability. This is where AnalyticDB Ray (hereinafter referred to as AnalyticDB Ray) breaks down.

AnalyticDB Ray is the fully-managed Ray service launched by AnalyticDB for MySQL based on the rich ecosystem of open source Ray. Multi-modal processing, physical intelligence, search recommendation, financial risk control, etc. The scenario is refined and full-stack enhancements are made to the Ray kernel and service capabilities. Developers can focus on their applications without worrying about cluster O&M, gaining cost-performance optimizations from the AnalyticDB Ray core. It also seamlessly integrates with the AnalyticDB lakehouse platform to build an integrated Data+AI architecture, accelerating the enterprise-scale adoption of AI.

Overview of AnalyticDB Ray's Core Enhancements Compared to Open-Source Ray:

	AnalyticDB Ray features
Ease of use	Automatically create a RayCluster	The console provides a one-click, GUI-based deployment capability¹. Users can create a RayCluster by simply creating an AI resource group and configuring the resource specifications for the Head and Worker nodes.
	Built-in LLM toolchain	Includes built-in tools for one-click distillation, fine-tuning, inference, and evaluation of LLMs for reinforcement learning.
	Built-in embodied AI toolchain	As a resource scheduling foundation for the Python ecosystem, AnalyticDB Ray supports frameworks like Cosmos, NeMo Curator, and GROOT N1 for data simulation, synthesis, and model fine-tuning.
Ecosystem integration	lance	Integrates with Lance for storing and processing multimodal data.
	llama-factory	Supports distributed fine-tuning via llama-factory-on-ray.
	spark	Supports hybrid resource deployment of Spark on Ray via Ray DP.
Cost-effectiveness	Multi-tenant/job resource isolation	Resolves resource isolation and sharing issues between tenants and jobs through vClusters and shared resource groups.
	Deep Data + AI Integration	AnalyticDB natively supports PB-scale data storage and analysis. Combined with Ray, it connects the entire pipeline from data processing and multi-source feature engineering to model inference. It also improves resource utilization by allowing Ray, AnalyticDB real-time analytics, and Spark workloads to share resources.
	AutoScaling	Automatically scales GPU/CPU resources up or down based on workload. It also supports low-cost Spot instances.
	Elastic caching	Elastically provisions caching service resources based on the data volume and bandwidth requirements of Ray's read/write operations.
	Fine-grained resource scheduling	Automatically schedules tasks based on node resource utilization and adds isolation mechanisms for GPU multi-tenant overselling, along with affinity/anti-affinity scheduling policies between tasks.
Stability & HA	Seamless migration and self-healing	Supports seamless rolling upgrades for clusters and automatic recovery from node failures.
Stability & HA	High availability	Supports primary/standby Head nodes for high availability.
Observability	Monitoring kanban	Provides persistent task dashboards and unified observability management across multiple clusters.

[1] https://www.alibabacloud.com/help/en/analyticdb/analyticdb-for-mysql/user-guide/managed-ray-service

Auto-scaling for Heterogeneous Resources: Maximizing GPU Utilization

● Streaming Computation: Uses a streaming computation model where intermediate data is stored in the Ray object store, solving the problem of intermittent disk writing in batch mode.

● Heterogeneous Auto-scaling: For data processing that requires both CPU and GPU resources, it independently and automatically scales CPU and GPU resources, maximizing the utilization of scarce GPU resources.

Enterprise-Grade Stability and HA: Automatic Head Node Failover

● Head Node HA: Switch within 5 seconds to ensure the stability of inference, high-quality tasks, and multi-tenant clusters.

● Metadata: The metadata store supports hot standby and cross-region disaster recovery.

Deep Observability: Boosting Development Efficiency

● Reinforcement Learning Observability: A visual monitoring dashboard provides real-time tracking of task status. For reinforcement learning scenarios, it supports Actor/Task-level topology analysis, improving problem diagnosis efficiency by 80%.

3. Practical use cases

Business Intelligence

Scenario: Predicting Click-Through Rates (CTR) for ad recommendations to identify target audiences. Offline batch inference is run at night, and the prediction results are delivered to the business team's AnalyticDB data warehouse tables.

Solutions:

AI pipeline: AnalyticDB Lakehouse → AnalyticDB ETL → AnalyticDB Ray ML (for model training and saving).
Inference: AnalyticDB Lakehouse → AnalyticDB ETL → AnalyticDB Ray (for offline batch inference) → AnalyticDB warehouse tables → Business services.

Benefits:

Heterogeneous Auto-scaling: The offline inference scenario uses heterogeneous worker groups, allowing CPU and GPU resources to scale independently. This increased GPU utilization from less than 5% to 40%.
Object Storage Auto-scaling: The object store dynamically scales its memory based on data volume, improving data processing performance by 2 to 3 times

LLM Offline Batch Inference for Data Distillation

Scenario: Prepare data for training large language models.

Scenario: Use Ray Data with vLLM/SGLang to deploy models like Qwen and Deepseek for data distillation. The distilled data is then used to train the large models.

Benefits:

Cache acceleration: Data loading throughput increased by 2-3 times.
Scheduling Scale: A single Ray cluster can schedule fine-grained tasks with up to 40,000 actors.
Precision Quantization: The quantized Deepseek INT8 version provides 50% performance improvement compared with FP8.

Multimodal Data Processing and Distributed Fine-Tuning

Scene: Creating personalized, interactive multimodal experiences.

Scheme: With AnalyticDB Ray at the core, integrate with Lance to enhance distributed image-text data processing and structuring capabilities using Ray Data. At the same time, integrate LLaMA-Factory to provide distributed fine-tuning capabilities for the Qwen-VL multimodal model via Ray.

Benefits:

One-stop solution: implements an all-in-one solution from data labeling to model fine-tuning.
Improved fine-tuning efficiency: llama-factory on ray fine-tuning efficiency is improved by 3-5 times in distributed mode.

Official website documents: https://www.alibabacloud.com/help/en/analyticdb/analyticdb-for-mysql/user-guide/managed-ray-service

Community

Unlocking In-Warehouse AI Pipelines: How AnalyticDB Ray Streamlines Development and O&M with Integrated Multimodal ETL+ML

Introduction

1. Open-Source Ray: The Cornerstone of Distributed Computing in the AI Era

2. AnalyticDB Ray: A Lightweight, One-Stop Data+AI Service

Auto-scaling for Heterogeneous Resources: Maximizing GPU Utilization

Enterprise-Grade Stability and HA: Automatic Head Node Failover

Deep Observability: Boosting Development Efficiency

3. Practical use cases

Business Intelligence

LLM Offline Batch Inference for Data Distillation

Multimodal Data Processing and Distributed Fine-Tuning

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

Platform For AI

Epidemic Prediction Solution

Bastionhost

Managed Service for Grafana