The Spark on ACK Practice of Hago

By Xiang Hua

Hago is a star product released by JOYY for multi-person interaction and socializing. Hago provides a variety of services, such as interactive games, multi-person voice calls, live streaming, and 3D virtual image interaction, committed to creating an efficient, diverse, and immersive social entertainment experience for users. It has an extensive user base in Southeast Asia, the Middle East, and South America.

At the technical level, Hago provides excellent self-developed audio and video technology to achieve more stable, efficient, and high-quality digital services, which include 3D hyperrealistic models, live anchor model production, virtual human voice, expression-driven service, natural sound text-to-speech (TTS), and mature virtual live streaming capabilities.

For a long time, Hago has been running big data tasks in its own IDC to support various products. In 2022, Hago decided to migrate its big data services to the cloud and adopted Spark on ACK for running them. This article will mainly focus on the migration process.

Problems Encountered in the IDC

Initially, Hago's Spark tasks were executed in Hadoop clusters within the IDC. During this time, Hago encountered several challenges:

Resource limitation: When resources were insufficient, tasks would accumulate and queue up.
Timeliness of scaling out: The logic for scaling out the cluster was complex, especially during holidays and events, requiring scaling preparation one week in advance.
Resource utilization rate: Spark tasks had distinct peaks and troughs, resulting in significant resource wastage during the troughs.

To address these challenges, Hago decided to migrate its big data business to the cloud in a cloud-native manner.

Implementation

Starting from Spark 3.1, Spark on Kubernetes has been officially available. As a managed Kubernetes release edition, Alibaba Cloud Container Service for Kubernetes (ACK) offers higher performance and greater stability. It serves as the optimal operating base for Spark on Alibaba Cloud. To achieve better elasticity, Hago opted for the ACK Serverless.

ACK Serverless

In an ACK Serverless cluster, you can deploy container applications directly without the need to purchase nodes. You don't have to worry about node maintenance and capacity planning for the cluster. Moreover, you will be charged based on the CPU and memory resources configured for the application. Thanks to the comprehensive Kubernetes compatibility and the simplified usage of ACK Serverless clusters, you can focus on your applications rather than managing the underlying infrastructure.

In addition, pods in ACK Serverless clusters run in a secure and isolated container runtime environment based on Alibaba Cloud Elastic Container Instances (ECI). Each pod container instance is completely isolated by lightweight virtualization security sandbox technology, ensuring that the instances do not affect each other.

ACK Serverless clusters exhibit their elasticity advantages in scenarios with large-scale business spikes and task scheduling, such as Spark. They can deliver thousands of pods within 30 seconds.

However, there are some issues that need to be resolved before running:

Storage and Computing Separation

As mentioned earlier, Spark tasks don't require continuous computing 24/7, while storage needs to be retained. Instead of building HDFS clusters on virtual machines, which would require a significant amount of resident computing power and result in wastage.

Hago has chosen to separate storage from computing. Data is stored in Object Storage Service (OSS), and the data is exposed using the HDFS interface through the OSS-HDFS service, enabling convenient access for Spark tasks.

For more information, see Overview of the OSS-HDFS service[1]

Selection of the Shuffle Service

Shuffle is a fundamental process in Spark and plays a critical role in the performance of Spark applications. The Spark community provides a default shuffle service [2], but there are some issues:

• Spark Shuffle relies on local storage, but in scenarios where storage and computing are decoupled and ECI is used, local disks are unavailable. This requires purchasing and attaching additional disks, which is neither cost-effective nor efficient.

• Spark has implemented dynamic allocation based on ShuffleTracking, but the efficiency of executor reclamation is low.

The following scenarios may occur:

• Data overflow in shuffle write tasks resulting in write amplification.

• Connection reset due to a large number of small-size network packets in shuffle read tasks.

• High disk and CPU loads caused by a large number of small-size I/O requests and random reads in shuffle read tasks.

• In cases where thousands of mappers (M) and reducers (N) are used, a large number of connections (M × N) are generated, making it nearly impossible for jobs to run.

E-MapReduce (EMR)'s Remote Shuffle Service (RSS) can optimize the issues above with the Spark Shuffle solution and support dynamic allocation in the ACK environment.

For more information, see: EMR Remote Shuffle Service [3]

Practical Effects

The general architecture diagram of the practice is shown above. It has ideal results:

• The scale-out operation only takes 30 seconds, so there is a need to prepare in advance.
• Tasks no longer need to be queued.
• The IDC hardware failure is not a concern.

References

[1] Overview of the OSS-HDFS service
https://www.alibabacloud.com/help/en/oss/user-guide/overview-1
[2] shuffle service
https://github.com/lynnyuan-arch/spark-on-k8s/blob/master/resource-managers/kubernetes/architecture-docs/external-shuffle-service.md
[3] EMR Remote Shuffle Service
https://www.alibabacloud.com/help/en/emr/emr-on-ecs/user-guide/celeborn#task-2184004

Community

The Spark on ACK Practice of Hago

Problems Encountered in the IDC

Implementation

ACK Serverless

Storage and Computing Separation

Selection of the Shuffle Service

Practical Effects

References

Read previous post:

Read next post:

Alibaba Cloud Native

You may also like

Comments

Alibaba Cloud Native

Related Products

Container Service for Kubernetes

ACK One

Cloud-Native Applications Management Solution

Function Compute