×
Community Blog Unlocking In-Warehouse AI Pipelines: How AnalyticDB Ray Streamlines Development and O&M with Integrated Multimodal ETL+ML

Unlocking In-Warehouse AI Pipelines: How AnalyticDB Ray Streamlines Development and O&M with Integrated Multimodal ETL+ML

This article introduces AnalyticDB Ray, a managed Ray service that integrates multimodal ETL and ML to streamline in-warehouse AI pipeline development and O&M.

Introduction

In today's data-driven era, the processing and analysis of multimodal data, including text, image, audio, video and other data types, is becoming increasingly important. By integrating multi-modal data ETL with ML (Machine Learning Platform for AI), AI pipelines can be built and optimized more efficiently, enabling a seamless transition from data to intelligent decisions. This article describes the fully managed Ray service provided by Alibaba Cloud AnalyticDB for MySQL, a cloud-native data warehouse. This service unlocks the potential of AI pipelines in data warehouses and seamlessly integrates multi-modal data ETL with ML.

1. Open-Source Ray: The Cornerstone of Distributed Computing in the AI Era

Open-source Ray is a distributed computing framework designed specifically for AI and High Performance Computing. It originated from UC Berkeley's AMPLab, the same lab that produced the Apache Spark project. With its simple API, Ray abstracts distributed scheduling with a concise API. With only a few lines of code, you can scale a stand-alone task to a thousand-node cluster and schedule remote resources like calling a local function. Built-in modules such as Ray Tune, Ray Train, and Ray Serve are seamlessly compatible with the TensorFlow and PyTorch ecosystems to support scenarios such as reinforcement learning and big data processing. The active open source community and support from enterprises such as Anyscale make it a great tool for building AI applications quickly.

Core Highlights of Ray:

A Unified Framework for All Distributed Computing Scenarios

  • Heterogeneous Scheduling: Supports hybrid, elastic scheduling of CPU, GPU, and FPGA resources.
  • Full-Stack Workloads: Handles the entire data and AI pipeline, including data preprocessing, model inference, and fine-tuning, as well as distributed execution of Python tasks.
  • Framework Compatibility: Integrates with mainstream ecosystems like Spark, TensorFlow/PyTorch, and Hugging Face.
  • Broad Scenario Coverage: Powers core business scenarios such as multimodal processing, search and recommendation, financial risk control, and graph computing.

1

Dynamic Resource Scheduling & Efficient Execution: Enables fine-grained, elastic resource scheduling, allocating CPU, GPU, memory, and custom resources on demand. It supports efficient data exchange through formats like Apache Arrow and TensorFlow Datasets to accelerate data processing.

Multi-Cloud and Large-Scale Scalability: Supports containerized deployment via Kubernetes, Docker Swarm, and others, allowing seamless use of multi-cloud resources. It is ideal for EB-scale data processing and handling models with hundreds of billions of parameters.

2

2. AnalyticDB Ray: A Lightweight, One-Stop Data+AI Service

Open-source Ray provides a highly flexible distributed computing framework for developers. In the actual production environment, enterprises often face problems such as distributed job optimization, fine-grained resource scheduling, cluster O&M, stability, and high availability. This is where AnalyticDB Ray (hereinafter referred to as AnalyticDB Ray) breaks down.

AnalyticDB Ray is the fully-managed Ray service launched by AnalyticDB for MySQL based on the rich ecosystem of open source Ray. Multi-modal processing, physical intelligence, search recommendation, financial risk control, etc. The scenario is refined and full-stack enhancements are made to the Ray kernel and service capabilities. Developers can focus on their applications without worrying about cluster O&M, gaining cost-performance optimizations from the AnalyticDB Ray core. It also seamlessly integrates with the AnalyticDB lakehouse platform to build an integrated Data+AI architecture, accelerating the enterprise-scale adoption of AI.

3

Overview of AnalyticDB Ray's Core Enhancements Compared to Open-Source Ray:

 

AnalyticDB Ray features

Ease of use

Automatically create a RayCluster

The console provides a one-click, GUI-based deployment capability¹. Users can create a RayCluster by simply creating an AI resource group and configuring the resource specifications for the Head and Worker nodes.

Built-in LLM toolchain

Includes built-in tools for one-click distillation, fine-tuning, inference, and evaluation of LLMs for reinforcement learning.

Built-in embodied AI toolchain

As a resource scheduling foundation for the Python ecosystem, AnalyticDB Ray supports frameworks like Cosmos, NeMo Curator, and GROOT N1 for data simulation, synthesis, and model fine-tuning.

Ecosystem integration

lance

Integrates with Lance for storing and processing multimodal data.

llama-factory

Supports distributed fine-tuning via llama-factory-on-ray.

spark

Supports hybrid resource deployment of Spark on Ray via Ray DP.

Cost-effectiveness

Multi-tenant/job resource isolation

Resolves resource isolation and sharing issues between tenants and jobs through vClusters and shared resource groups.

Deep Data + AI Integration

AnalyticDB natively supports PB-scale data storage and analysis. Combined with Ray, it connects the entire pipeline from data processing and multi-source feature engineering to model inference. It also improves resource utilization by allowing Ray, AnalyticDB real-time analytics, and Spark workloads to share resources.

AutoScaling

Automatically scales GPU/CPU resources up or down based on workload. It also supports low-cost Spot instances.

Elastic caching

Elastically provisions caching service resources based on the data volume and bandwidth requirements of Ray's read/write operations.

Fine-grained resource scheduling

Automatically schedules tasks based on node resource utilization and adds isolation mechanisms for GPU multi-tenant overselling, along with affinity/anti-affinity scheduling policies between tasks.

Stability & HA     

Seamless migration and self-healing

Supports seamless rolling upgrades for clusters and automatic recovery from node failures.

High availability

Supports primary/standby Head nodes for high availability.

Observability

Monitoring kanban

Provides persistent task dashboards and unified observability management across multiple clusters.

[1] https://www.alibabacloud.com/help/en/analyticdb/analyticdb-for-mysql/user-guide/managed-ray-service

Auto-scaling for Heterogeneous Resources: Maximizing GPU Utilization

Streaming Computation: Uses a streaming computation model where intermediate data is stored in the Ray object store, solving the problem of intermittent disk writing in batch mode.

Heterogeneous Auto-scaling: For data processing that requires both CPU and GPU resources, it independently and automatically scales CPU and GPU resources, maximizing the utilization of scarce GPU resources.

4

Enterprise-Grade Stability and HA: Automatic Head Node Failover

Head Node HA: Switch within 5 seconds to ensure the stability of inference, high-quality tasks, and multi-tenant clusters.

Metadata: The metadata store supports hot standby and cross-region disaster recovery.

5

Deep Observability: Boosting Development Efficiency

Reinforcement Learning Observability: A visual monitoring dashboard provides real-time tracking of task status. For reinforcement learning scenarios, it supports Actor/Task-level topology analysis, improving problem diagnosis efficiency by 80%.

_

3. Practical use cases

Business Intelligence

Scenario: Predicting Click-Through Rates (CTR) for ad recommendations to identify target audiences. Offline batch inference is run at night, and the prediction results are delivered to the business team's AnalyticDB data warehouse tables.

Solutions:

  • AI pipeline: AnalyticDB Lakehouse → AnalyticDB ETL → AnalyticDB Ray ML (for model training and saving).
  • Inference: AnalyticDB Lakehouse → AnalyticDB ETL → AnalyticDB Ray (for offline batch inference) → AnalyticDB warehouse tables → Business services.

6

Benefits:

  • Heterogeneous Auto-scaling: The offline inference scenario uses heterogeneous worker groups, allowing CPU and GPU resources to scale independently. This increased GPU utilization from less than 5% to 40%.
  • Object Storage Auto-scaling: The object store dynamically scales its memory based on data volume, improving data processing performance by 2 to 3 times

LLM Offline Batch Inference for Data Distillation

Scenario: Prepare data for training large language models.

Scenario: Use Ray Data with vLLM/SGLang to deploy models like Qwen and Deepseek for data distillation. The distilled data is then used to train the large models.

7

Benefits:

  • Cache acceleration: Data loading throughput increased by 2-3 times.
  • Scheduling Scale: A single Ray cluster can schedule fine-grained tasks with up to 40,000 actors.
  • Precision Quantization: The quantized Deepseek INT8 version provides 50% performance improvement compared with FP8.

Multimodal Data Processing and Distributed Fine-Tuning

Scene: Creating personalized, interactive multimodal experiences.

Scheme: With AnalyticDB Ray at the core, integrate with Lance to enhance distributed image-text data processing and structuring capabilities using Ray Data. At the same time, integrate LLaMA-Factory to provide distributed fine-tuning capabilities for the Qwen-VL multimodal model via Ray.

8

Benefits:

  • One-stop solution: implements an all-in-one solution from data labeling to model fine-tuning.
  • Improved fine-tuning efficiency: llama-factory on ray fine-tuning efficiency is improved by 3-5 times in distributed mode.

Official website documents: https://www.alibabacloud.com/help/en/analyticdb/analyticdb-for-mysql/user-guide/managed-ray-service

0 1 0
Share on

ApsaraDB

568 posts | 179 followers

You may also like

Comments

ApsaraDB

568 posts | 179 followers

Related Products

  • Platform For AI

    A platform that provides enterprise-level data modeling services based on machine learning algorithms to quickly meet your needs for data-driven operations.

    Learn More
  • Epidemic Prediction Solution

    This technology can be used to predict the spread of COVID-19 and help decision makers evaluate the impact of various prevention and control measures on the development of the epidemic.

    Learn More
  • Bastionhost

    A unified, efficient, and secure platform that provides cloud-based O&M, access control, and operation audit.

    Learn More
  • Managed Service for Grafana

    Managed Service for Grafana displays a large amount of data in real time to provide an overview of business and O&M monitoring.

    Learn More