Moonshot AI is an innovative enterprise specializing in the research, development, and application of artificial intelligence technologies. Its core product, Kimi, relies on the self-developed large language model to provide users with efficient and accurate AI intelligent assistant services. With capabilities such as multi-round conversations, long text comprehension, and cross-domain inference, Kimi is widely used in scenarios like AI search, data analysis, and content generation. It is popular among a large number of professional users and enterprises.
As a startup company specializing in basic models, to accelerate the training and iteration of large models, Moonshot AI needs cost-effective, elastic, and flexible CPU and GPU computing power to meet the business requirements of model iteration in terms of model data preprocessing.
As an important material in the training process of large models, data plays an important role in improving model performance and optimizing model effectiveness. High-quality data is crucial for large model companies like Moonshot AI. Model data preprocessing needs to clean large amounts of text and multimodal data, including text, image, audio, and video formats. In the original architecture of the user, data preprocessing tasks are carried out based on the self-managed Ray and Spark frameworks. After a period of verification and operation, the following pain points and challenges are encountered:
1. Stability challenges of large-scale clusters: Large-scale Spark/Ray clusters have poor stability, making it difficult to ensure stable execution of TB-level data preprocessing tasks.
2. Low resource elasticity efficiency: In the face of unexpected tasks or fluctuations in data volume, computing resources cannot be quickly scaled on demand, resulting in a waste of computing resources or task delays. For example, when processing small-scale data, cluster resources may be over-allocated. However, when a large-scale task occurs, it is impossible to scale up the cluster resources in a timely manner, which seriously affects the timeliness of task processing. Therefore, the Moonshot AI team hopes to provide extreme elasticity on demand for short-term tasks to better adapt to dynamically changing workloads.
3. Lack of observability systems and flexible scheduling mechanisms: Business parties that use data involve teams or groups in different directions, and need to allocate resources and configure task priorities to ensure the stable operation of high-priority tasks. However, due to the lack of task-level monitoring capabilities, it is difficult to observe the task progress and resource usage in real time, leading to the inability to adjust the scheduling strategy in time, which further affects the overall operation efficiency of the cluster.
To address the pain points of Moonshot AI in terms of large-scale cluster stability, resource elasticity efficiency, observability, and flexible scheduling, Alibaba Cloud proposes a solution centered around Alibaba Cloud Container Service for Kubernetes (ACK) to deeply optimize Ray and Spark tasks.
ACK is fully managed and O&M-free to resolve the stability issues of large-scale clusters. Moreover, ACK fully leverages its key capabilities in terms of extreme elasticity and diverse computing power to achieve the on-demand extreme auto scaling of tasks. In addition, with the observability tools and flexible scheduling mechanism integrated into ACK, Moonshot AI can monitor tasks and dynamically allocate resources, meeting the flexible scheduling requirements of different business teams for task priorities. This solution has brought significant efficiency improvements to Moonshot AI and greatly optimized the cost investment.
ACK provides managed KubeRay components that are optimized for stability based on open-source community versions. You do not need to install and maintain open-source KubeRay on your own. All you need to do is focus on how to combine CRDs such as RayCluster and RayJob with your own business or workflow engines. This greatly simplifies your maintenance efforts. KubeRay can be quickly deployed in ACK clusters. It is also widely used by data scientists, ML engineers, platform engineers, and developers.
Based on the common operator patterns in Kubernetes, Spark Operator defines two CRDs: SparkApplication and SchedulerSparkApplication. It allows users to write YAML resource lists to submit Spark jobs without manually constructing lengthy spark-submit commands. Therefore, Spark Operator is widely used in various data preprocessing tasks.
Alibaba Cloud Container Service for Kubernetes (ACK) offers the ack-spark-operator component to meet your needs for Spark on container clusters. You can quickly build a Spark computing cluster without the need to maintain Spark Operator. In addition, the ack-spark-operator component has the following advantages over the open-source spark-operator:
● It integrates with other components of Alibaba Cloud, such as the ack-kube-queue job queuing feature.
● Spark Operator uses webhooks to supplement Kubernetes features that are not supported by native Spark, including driver/executor tolerations, nodeSelector, and volume mounting.
● It integrates with workflow orchestration frameworks such as Airflow and Argo.
● It manages the version of the YAML file of a Spark job using a version control system like Git.
As data development efforts continue, the scope of data processing and the number of team members also see rapid growth. Problems such as poor data exploration efficiency and lack of interactivity in development operations have gradually emerged one by one. To address these issues, Alibaba Cloud provides the ability to deploy Zeppelin in ACK to support continuous and efficient data processing in Moonshot AI.
Apache Zeppelin is a Notebook tool for big data interactive analytics and visualization. It can be used to access, discover, transform, analyze, and visualize data. Its front end provides rich visual graphics libraries, and the back end integrates various common interpreters, such as Spark, Flink, JDBC, Markdown, and Shell, in the form of plug-in structure extension. This allows data analysts to easily use SQL statements to develop data in Zeppelin Notebook.
When dealing with large-scale data, it is crucial to allocate computing resources reasonably. In the daily Spark jobs of the Moonshot AI team, computing resources cover ECS fixed instances, ACS elastic resources, and ACS BestEffort computing power, which can be flexibly selected according to requirements.
Alibaba Cloud Container Compute Service (ACS) provides a basic pod runtime environment for Kubernetes. You can dynamically schedule the driver and executor pods in Spark jobs to ACS to implement serverless Spark job execution. Meanwhile, each ACS container instance is completely isolated based on the lightweight virtualization security sandbox technology to ensure that the container instances do not interfere with each other.
In some task scenarios, Moonshot AI widely adopts ACS BestEffort computing power, which not only ensures the efficient execution of tasks but also effectively saves computing costs. BestEffort is a flexible resource scheduling policy that is suitable for short-running jobs and stateless applications with high scalability and fault tolerance.
Deploying Spark jobs on ACS also has the following significant benefits:
● Ultra-large capacity: You can create more than 50,000 pods in an ACK cluster without the need to add additional configurations or design the size of the cluster.
● Scaling within seconds: You can create thousands of pods in a very short period to deliver a large amount of computing power without worrying about the latency of pod creation during peak hours.
● Cost saving: Pods are created on demand and billed on a pay-as-you-go basis avoiding resource waste caused by idle resources. Spot instances are available to save costs.
Massive computing power must be matched with efficient data storage and reading capabilities to truly achieve efficient execution of large-scale data processing. Alibaba Cloud Object Storage Service (OSS) provides high performance, high availability, and stability to ensure the storage and rapid processing of large amounts of data in large-scale data preprocessing scenarios in Moonshot AI.
In practical tasks, OSS supports extremely high concurrent processing capabilities and bandwidth requirements. The QPS can reach up to hundreds of thousands and the internal network has a read/write bandwidth at the Tbit/s level. This effectively handles highly concurrent read/write requests of massive data and ensures smooth data processing.
The Moonshot AI team utilizes throttling group capabilities of the resource pool in Alibaba Cloud OSS to implement bucket-level bandwidth adjustment. By flexibly allocating bandwidth to different requesters as needed, it ensures that key services and compute-intensive tasks preferentially receive sufficient resources in high-load periods. In addition, by configuring throttling events, when the preset threshold is reached, a notification will be promptly sent to the task administrator, safeguarding the stable operation of data processing tasks in all aspects.
In terms of data security, relying on the multi-version control function of Alibaba Cloud OSS, the Moonshot AI team also traces changes in file content to avoid data loss. You can configure versioning for a bucket so that existing objects in the bucket are stored as previous versions when they are overwritten or deleted. When data is accidentally deleted or objects are overwritten, the multi-version control feature of OSS allows you to restore objects stored in buckets to previous versions at any time. This ensures data integrity and availability.
In practical application scenarios, you can create a RayCluster by submitting a RayJob to process datasets. In this process, performance monitoring of RayCluster is of great importance. Native RayCluster provides dashboards for viewing metrics. However, you must manually deploy the open-source Prometheus and Grafana. This approach makes it difficult to ensure the stability of components and lacks the support of professional O&M personnel.
The Ray on ACK solution is deeply integrated with the monitoring capabilities of Managed Service for Prometheus and customizes a monitoring dashboard for RayCluster. Users only need to install the ack-prometheus component and corresponding ACK Pod Monitor and Service Monitor resources to quickly collect the metric data of RayCluster and display it visually through the dedicated RayCluster monitoring dashboard.
By virtue of the tight integration of cloud products, the Ray on ACK solution provides users with highly available RayCluster monitoring capabilities, greatly simplifying the O&M process and enhancing the stability and reliability of the production environment.
Both Spark and Ray provide feature-rich Web UIs for monitoring and displaying the execution status of jobs. Users can easily view the running status of jobs through the Web UI, including key information such as running and completed jobs as well as submission time, execution time, and progress details of each job.
ACK marketplace provides ack-spark and ray-history-server components that support HDFS, OSS, and OSS-HDFS as log storage backends. The Moonshot AI team writes event logs to Object Storage Service (OSS) for various jobs. Then, the team can configure the same OSS path in the History Server to parse the logs and present them in the web UI.
When you run a Spark job in an ACK cluster, a large number of logs are generated and distributed in different pods. This complicates log management. The Moonshot AI team leverages Alibaba Cloud Simple Log Service (SLS) to fully utilize its one-stop log collection, processing, query, and analysis capabilities to efficiently manage Spark logs.
Spark business logs include event logs, engine logs, and business logs. Event logs are written to OSS by using a specific SDK for Spark History Server to parse and render. Engine logs and business logs are simultaneously exported to the console and specified files for subsequent query and analysis.
Spark jobs use both ECS and ACS resources whose log collection methods are different. You can create the AliyunConfig resources provided by Simple Log Service to unify the log collection configurations of ECS and ACS pods, which further improves the convenience and efficiency of log management.
Through this series of solutions, Moonshot AI greatly simplifies the log collection and analysis process, enabling developers to focus on optimizing business logic without spending a lot of time on log management. The unified log collection configuration improves the system stability and reliability of Moonshot AI, providing a solid foundation for the preprocessing of massive data.
Through Alibaba Cloud's fully managed containerization solution, Moonshot AI has been upgraded in all aspects in terms of data processing capabilities, resource utilization efficiency, and O&M management:
● Improved stability: Based on the Ray and Spark components fully managed by ACK clusters, the stability of processing terabytes of data for a single task reaches 99.95%. The time efficiency of fault location is improved by 60%, ensuring the continuity of data supply for model training.
● Greatly enhanced elasticity efficiency: The hybrid architecture of ECS and ACS shortens the resource supply time for short-term tasks to seconds, increases resource utilization for unexpected tasks by three times, and reduces overall computing costs by 45%.
● Refined resource management: Based on the resource pool throttling capability and priority-based scheduling policy of OSS, the SLA compliance rate of high-priority tasks is increased to over 99%, and the resource conflict rate is decreased by 95%.
● End-to-end observability: The cloud-native monitoring system of Prometheus and SLS allows you to visualize end-to-end metrics. The response speed of anomaly detection is increased by 50%, and the log analysis efficiency is improved by 40%.
The container solution of Alibaba Cloud ACK + ACS has brought a qualitative leap for our large model data preprocessing. With greatly improved stability and efficient resource elasticity, costs are remarkably reduced, and O&M management becomes much easier. We look forward to further in-depth cooperation with Alibaba Cloud to jointly explore more efficient and low-cost data processing solutions and promote the continuous progress of large model technology.
-- James Wang, Technical Expert of Moonshot AI
End-to-end Canary Release through Alibaba Cloud ASM Lanes and Kruise Rollout
Alibaba Cloud ACK One: Registered Clusters Support ACS Computing Power
201 posts | 33 followers
FollowAlibaba Container Service - September 13, 2024
amap_tech - December 4, 2019
ApsaraDB - February 18, 2025
Alibaba Clouder - October 31, 2019
Alibaba Clouder - May 25, 2017
amap_tech - October 29, 2020
201 posts | 33 followers
FollowAccelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn MoreTop-performance foundation models from Alibaba Cloud
Learn MoreOffline SDKs for visual production, such as image segmentation, video segmentation, and character recognition, based on deep learning technologies developed by Alibaba Cloud.
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreMore Posts by Alibaba Container Service