Introduction to the Ray distributed framework - AnalyticDB

The demand for processing multimodal data, such as text, images, audio, and video, is growing. AnalyticDB for MySQL introduces AnalyticDB Ray, which integrates extract, transform, and load (ETL) and machine learning (ML) for multimodal data. This integration improves the efficiency of AI pipelines and enables a seamless transition from data to intelligent decisions.

What is AnalyticDB Ray?

Open source Ray is a distributed computing framework designed for AI and high-performance computing. It provides a simple API abstraction to efficiently implement distributed scheduling, which allows you to scale single-node tasks to thousand-node clusters with just a few lines of code, scheduling remote resources in the same way that you call local functions. Ray's built-in modules, such as Ray Tune, Ray Train, and Ray Serve, seamlessly integrate with TensorFlow and PyTorch. With active open source community support from companies like Anyscale, Ray has become an important tool for building AI applications.

While open source Ray provides highly flexible distributed computing capabilities, enterprises still face challenges in production environments, such as distributed job optimization, fine-grained resource scheduling, complex cluster operations, system stability, and high availability.

To address these challenges, AnalyticDB for MySQL introduces AnalyticDB Ray, a fully managed Ray service. AnalyticDB Ray is built on the rich ecosystem of open-source Ray. It has been validated in typical scenarios such as multimodal processing, embodied AI, search and recommendation, and finance risk control. AnalyticDB Ray enhances the Ray kernel and service capabilities, optimizes kernel performance, simplifies cluster O&M, and seamlessly integrates with the AnalyticDB for MySQL lakehouse platform. This helps enterprises build an integrated Data+AI architecture and accelerate the large-scale adoption of AI.

Advantages of AnalyticDB Ray

Ease of use
- Automatic RayCluster creation: The console provides a one-click deployment feature. You can create a RayCluster by creating an AI-type resource group and configuring the resource specifications for the head and worker nodes.
- Built-in large model fine-tuning and inference toolchain: It includes tools for one-click reinforcement learning, distillation, fine-tuning, inference, and evaluation of Large Language Models (LLMs).
- Built-in embodied AI toolchain: AnalyticDB Ray serves as a foundational resource scheduler for the Python ecosystem and supports tools such as Cosmos, NeMo Curator, and GROOT N1 for data simulation, synthesis, and model fine-tuning.
Ecosystem integration
- Lance: Supports storing multimodal data.
- Llama-factory: Supports distributed fine-tuning on the Ray platform.
- Spark: Supports running Spark on Ray through Ray DP to enable hybrid resource deployment.
Cost-effectiveness
- Multi-tenancy and multi-job resource isolation: Provides resource isolation and sharing between tenants and jobs using vClusters and shared resource groups.
- Deep Data+AI integration: AnalyticDB natively supports petabyte-scale data storage and analysis. When combined with Ray, it provides an end-to-end pipeline for data processing, multimodal feature engineering, and model inference. The ability to share resources among Ray, AnalyticDB real-time analysis workloads, and Spark significantly improves resource utilization.
- Auto-scaling: Automatically scales GPU and CPU resources up or down based on the workload. It also supports low-cost Spot resources.
- Elastic cache: Allows you to flexibly build cache service resources based on Ray's data read and write volumes and bandwidth requirements.
- Fine-grained resource scheduling: Automatically schedules resources based on node utilization. It also provides isolation mechanisms for GPU overselling in multi-tenant environments and supports affinity and anti-affinity scheduling policies between tasks.
Stability and high availability
- Non-disruptive migration and self-healing: Supports non-disruptive rolling upgrades for clusters and provides automatic recovery for abnormal nodes.
- High availability: Supports an active-standby head node configuration.
Observability
Monitoring dashboards: Provides persistent task dashboards and unified observability management across multiple clusters.