Moonshot AI is an innovative enterprise dedicated to the R&D and application of artificial intelligence technologies. Its core product Kimi relies on the self-developed large language model to provide users with efficient and accurate AI-powered intelligent assistant services. With capabilities such as multi-round conversations, long-context comprehension, and cross-domain inference, Kimi is widely used in scenarios like AI-powered search, data analysis, and content generation. It is highly favored by a large number of professionals and enterprises.
As a startup company specializing in foundation models, Moonshot AI requires cost-effective, elastic, and flexible CPU and GPU computing power to accelerate the training iteration of large models and to meet the business requirements of model data preprocessing.
As an important input in the training process of large language models, data plays a pivotal role in enhancing model performance and optimizing model effectiveness. High-quality data is crucial for foundation model developers like Moonshot AI. Model data preprocessing needs to cleanse large amounts of textual and multimodal data, including text, image, audio, and video formats. Within the existing architecture of the user, data preprocessing tasks are carried out based on the self-managed Ray and Spark frameworks. After a period of verification and operation, the following pain points and challenges emerged:
1. Stability challenges in large-scale clusters: Large-scale Spark/Ray clusters have poor stability, making it difficult to stably execute TB-level data preprocessing tasks.
2. Inefficient resource elasticity: Computing resources cannot be quickly scaled on demand to handle bursty tasks and data volume fluctuations, resulting in a waste of computing resources or task delays. For example, when processing small-scale data, cluster resources may be over-allocated. However, when a large-scale task occurs, it is impossible to scale up the cluster resources in a timely manner, which seriously compromises processing timeliness. Therefore, the Moonshot AI team hopes to provide extreme elasticity on demand for short-duration tasks to better adapt to dynamically changing workloads.
3. Lack of observability systems and flexible scheduling mechanisms: Business parties that consume data resources involve teams or groups across different domains, and need to allocate resources and configure task priorities to ensure the stable operation of high-priority tasks. However, due to the lack of task-level monitoring capabilities, it is difficult to observe the task progress and resource usage in real time, leading to delayed scheduling strategy adjustments, which further degrades the overall operation efficiency of the cluster.
To address the critical challenges of Moonshot AI in large-scale cluster stability, resource elasticity efficiency, observability, and flexible scheduling, Alibaba Cloud proposes a solution centered on Alibaba Cloud Container Service for Kubernetes (ACK), featuring deep optimizations for Ray and Spark tasks.
ACK is fully managed and O&M-free to resolve the stability issues of large-scale clusters. Moreover, ACK fully leverages its key capabilities in extreme elasticity and diverse computing power to achieve the on-demand and extreme auto scaling of tasks. In addition, with the observability tools and flexible scheduling mechanism integrated into ACK, Moonshot AI can monitor tasks and dynamically allocate resources, meeting the flexible priority-based scheduling across different business teams. This solution has brought significant efficiency improvements to Moonshot AI and substantially optimized the cost investment.
ACK provides managed KubeRay components that are optimized for stability based on open-source community versions, eliminating the need to install and maintain open-source KubeRay on your own. All you need is to focus on combining CRDs such as RayCluster and RayJob with your own business or workflow engines. This greatly simplifies your maintenance efforts. KubeRay can be quickly deployed in ACK clusters. It is also widely used by data scientists, ML engineers, platform engineers, and developers.
Based on the common operator patterns in Kubernetes, Spark Operator defines two CRDs: SparkApplication and SchedulerSparkApplication. It allows users to write YAML resource lists to submit Spark jobs without manually constructing lengthy spark-submit commands. Therefore, Spark Operator is widely used in various data preprocessing tasks.

Alibaba Cloud Container Service for Kubernetes (ACK) offers the ack-spark-operator component to meet your needs for Spark on container clusters. You can quickly build a Spark computing cluster without the need to maintain Spark Operator. In addition, the ack-spark-operator component offers several advantages over the open-source spark-operator:
● It integrates with other components of Alibaba Cloud, such as the ack-kube-queue job queuing feature.
● Spark Operator uses webhooks to supplement Kubernetes features that are not supported by native Spark, including the tolerations for the driver and the executor, nodeSelector, and volume mounting.
● It integrates with workflow orchestration frameworks such as Airflow and Argo.
● It manages the version of the YAML file of a Spark job using Git and similar systems.
As data development efforts continue, the scope of data processing and the number of team members also see rapid growth. Challenges such as inefficient data exploration and lack of interactivity in development operations have gradually emerged one by one. To address these issues, Alibaba Cloud provides the ability to deploy Zeppelin in ACK, enabling continuous and efficient data processing in Moonshot AI.
Apache Zeppelin is a Notebook tool for big data interactive analytics and visualization. It can be used to access, discover, transform, analyze, and visualize data. Its front end provides rich visualization libraries, and the back end integrates various common interpreters, such as Spark, Flink, JDBC, Markdown, and Shell, in the form of plug-in structure extension. This allows data analysts to easily use SQL statements to develop data in Zeppelin Notebook.

When dealing with large-scale data, it is crucial to allocate computing resources reasonably. In the daily Spark jobs of the Moonshot AI team, computing resources cover ECS fixed instances, ACS elastic resources, and ACS BestEffort computing power, which can be flexibly selected according to requirements.
Alibaba Cloud Container Compute Service (ACS) provides a fundamental pod runtime environment for Kubernetes. You can dynamically schedule the driver and executor pods in Spark jobs to ACS to implement serverless Spark execution. Meanwhile, each ACS container instance is completely isolated based on the lightweight virtualization sandbox technology to ensure that the container instances do not interfere with each other.
In specific task scenarios, Moonshot AI widely adopts ACS BestEffort computing power, which not only ensures the efficient execution of tasks but also significantly saves computing costs. BestEffort is a flexible resource scheduling policy that is suitable for short-duration jobs and stateless applications with high scalability and fault tolerance.
Deploying Spark jobs on ACS delivers the following significant benefits:
● Ultra-large capacity: You can create more than 50,000 pods in an ACK cluster without the need to add additional configurations or design the size of the cluster.
● Scaling within seconds: You can create thousands of pods in a very short period to deliver massive computing power with guaranteed low latency of pod creation during peak hours.
● Cost saving: Pods are created on demand and billed on a pay-as-you-go basis avoiding resource waste caused by idle resources. Spot instances are available to save costs.
Massive computing power must be matched with efficient data storage and retrieval capabilities to truly achieve efficient execution of large-scale data processing. Alibaba Cloud Object Storage Service (OSS), with its high performance, high availability, and stability, ensures the storage and rapid processing of large amounts of data in large-scale data preprocessing scenarios in Moonshot AI.
In practical tasks, OSS supports outstanding concurrent processing capabilities and bandwidth requirements. The QPS can reach up to hundreds of thousands and the internal network has a read/write bandwidth at the Tbit/s level. This effectively handles high concurrent read/write requests for massive data and ensures smooth data processing.
The Moonshot AI team utilizes throttling group capabilities of the resource pool in OSS to implement bucket-level bandwidth adjustment. By dynamically allocating bandwidth across different requesters as needed, it ensures that key services and compute-intensive tasks preferentially receive sufficient resources in high-load periods. In addition, by configuring throttling events, when the preset threshold is reached, a notification will be promptly sent to the task administrator, safeguarding the stable operation of data processing tasks in all aspects.
In terms of data security, relying on the multi-version control function of OSS, the Moonshot AI team also realizes the traceability of file content changes to prevent data loss. You can configure versioning for a bucket so that existing objects in the bucket are stored as previous versions when they are overwritten or deleted. When data is accidentally deleted or objects are overwritten, the multi-version control feature of OSS allows you to restore objects stored in buckets to previous versions at any time. This ensures data integrity and availability.
In practical application scenarios, you can create a RayCluster by submitting a RayJob to process datasets. In this process, monitoring RayCluster performance is of great importance. Native RayCluster provides dashboards for viewing metrics. However, you must manually deploy the open-source Prometheus and Grafana. This approach makes it difficult to ensure the stability of components and lacks the support of professional O&M personnel.
The Ray on ACK solution is deeply integrated with the monitoring capabilities of Managed Service for Prometheus, customizing a monitoring dashboard for RayCluster. Users only need to install the ack-prometheus component and corresponding ACK Pod Monitor and Service Monitor resources to quickly collect the metric data of RayCluster and present it visually through a dedicated RayCluster monitoring dashboard.

By virtue of the tight integration with cloud services, the Ray on ACK solution provides users with highly available RayCluster monitoring capabilities, greatly simplifying the operation and maintenance procedures and enhancing the stability and reliability of the production environment.
Both Spark and Ray provide feature-rich Web UIs for monitoring and displaying the execution status of jobs. Users can easily view the running status of jobs through the Web UI, including key information such as running and completed jobs as well as submission time, execution duration, and progress details of each job.
ACK marketplace offers ack-spark and ray-history-server components that support HDFS, OSS, and OSS-HDFS as log storage backends. The Moonshot AI team writes event logs to Object Storage Service (OSS) across various jobs. Then, the team can configure the same OSS path in History Server to parse the logs and present them in the web UI.

When you run a Spark job in an ACK cluster, a large number of logs are generated and distributed in different pods. This complicates log management. The Moonshot AI team leverages Alibaba Cloud Simple Log Service (SLS) to fully utilize its end-to-end log collection, processing, query, and analysis capabilities to efficiently manage Spark logs.
Spark business logs include event logs, engine logs, and business logs. Event logs are written to OSS by using a specific SDK for parsing and rendering by Spark History Server. Engine logs and business logs are simultaneously output to the console and specified files for subsequent query and analysis.
Spark jobs use both ECS and ACS resources whose log collection methods are different. You can create the AliyunConfig resources provided by Simple Log Service to unify the log collection configurations for ECS and ACS pods, which further improves the convenience and efficiency of log management.
Through a series of solutions, Moonshot AI greatly simplifies the log collection and analysis process, enabling developers to focus on the optimization of business logic rather than spending excessive time on log management. The unified log collection configuration improves the system stability and reliability of Moonshot AI, establishing a solid foundation for the preprocessing of massive data.
Through Alibaba Cloud's fully managed containerization solution, Moonshot AI has been upgraded in all aspects in terms of data processing capabilities, resource utilization efficiency, and O&M management:
● Improved stability: Based on the Ray and Spark components fully managed by ACK clusters, the stability of processing terabytes of data for a single task reaches 99.95%. The time efficiency of fault location is improved by 60%, ensuring the continuity of data supply for model training.
● Significant elasticity improvement: The hybrid architecture (ECS + ACS) shortens the resource supply time for short-duration tasks to seconds, increases resource utilization for unexpected tasks by three times, and reduces overall computing costs by 45%.
● Refined resource management: Based on the resource pool throttling capability and priority-based scheduling policy of OSS, the SLA compliance rate of high-priority tasks exceeds 99%, and the resource contention rate is reduced by 95%.
● End-to-end observability: The cloud-native monitoring system (Prometheus + SLS) allows you to visualize end-to-end metrics, improve the response speed of anomaly detection by 50%, and improve the log analysis efficiency by 40%.
The containerization solution of Alibaba Cloud ACK + ACS has brought a qualitative leap for our large model data preprocessing. With greatly improved stability and efficient resource elasticity, costs are remarkably reduced, and O&M management becomes much easier. We look forward to further in-depth cooperation with Alibaba Cloud to jointly explore more efficient and low-cost data processing solutions and promote the continuous progress of large model technology.
-- James Wang, Technical Expert of Moonshot AI
End-to-end Canary Release through Alibaba Cloud ASM Lanes and Kruise Rollout
Alibaba Cloud ACK One: Registered Clusters Support ACS Computing Power
222 posts | 33 followers
FollowAlibaba Container Service - September 13, 2024
Alibaba Cloud Native - October 27, 2021
Data Geek - January 22, 2025
Alibaba Cloud Native Community - September 8, 2025
Alibaba Clouder - July 17, 2020
Justin See - November 7, 2025
222 posts | 33 followers
Follow
AI Acceleration Solution
Accelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn More
Offline Visual Intelligence Software Packages
Offline SDKs for visual production, such as image segmentation, video segmentation, and character recognition, based on deep learning technologies developed by Alibaba Cloud.
Learn More
ACK One
Provides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn More
Tongyi Qianwen (Qwen)
Top-performance foundation models from Alibaba Cloud
Learn MoreMore Posts by Alibaba Container Service