NetEase Cloud Music: Explore Serverless Solutions for Audio and Video Algorithms

By Liao Xiangli

The previous audio and video technologies of NetEase Cloud Music were mostly applied to the data processing of the music library. Based on the experience of the service of audio and video algorithms, the NetEase Cloud Music Music Library Team and the Audio and Video Algorithm Team collaborated to build a centralized audio and video algorithm processing platform. This article explains how NetEase Cloud Music uses Serverless technology to optimize its entire audio and video processing platform.

This article is split into three parts:

Current Situation: The application of audio and video technology in NetEase Cloud Music and the problems encountered before the introduction of Serverless technology
Selection: The considerations when investigating Serverless solutions
Outlook: The transformations we have made, the final landing effect, and future planning

Current Situation

As a music-based company, audio and video technology is widely used in many business scenarios of NetEase Cloud Music. Here are five common scenarios:

By default, users hear the sound quality of the standardized bitrate that we have previously converted using the audio transcoding algorithm. Some users want better sound quality. Some users want lower sound quality due to limited network traffic plans.
Users can use the music recognition function in the app to identify music in the environment, which uses audio fingerprint extraction and identification technology.
We offer chorus detection for some VIP songs on the platform to give users a better experience. Here, the chorus detection algorithm is used.
We need to display the audio pitch and score in the karaoke function of cloud music. Here, we use the pitch generation algorithm.
We also provide transliteration lyrics in Japanese, Cantonese, and other languages for non-Chinese-speaking users. Here, the automatic Romaji algorithm is used.

As you can see from the scenarios above, audio and video technology is widely used in different scenarios and plays an important role.

Audio and video technology can be divided into three categories: analysis and understanding, processing, and creation and production. Part of these are processed on the end in the form of SDK. More parts aim to provide general audio and video capabilities in the form of services through algorithm engineering and backend cluster deployment management. This part is the focus of our article today.

We need to understand the characteristics of many related audio and video algorithms in the service deployment of audio and video algorithms, such as the deployment environment, execution time, and whether they support concurrent processing. With the increase of our landing algorithms, we have summarized the following rules:

Long Execution Time of the Algorithm: The execution time is often proportional to the duration of the original audio. In many scenarios, the duration range of audio and video is large. Based on this feature, we often use asynchronous mode in the design of the execution unit.
Audio and Video Algorithms Have Multi-Language Features: Cloud music algorithms include C++, Python, and other languages, which will bring huge difficulties to the environment context. In order to solve this problem, we adopt standardized conventions and image delivery methods to decouple various environment preparations. Therefore, whether it can support image deployment will become a key point in our technical selection.
The Demand for Flexibility Is Increasing: The songs on the platform have grown from 5 million when I joined the company to over 60 million today. The gap between increment and stock is getting larger. When we quickly implement an algorithm, we should consider incremental access and stock rapid processing. Therefore, in the system design, the minimum granularity of the execution unit will be separated to facilitate rapid scale-out.

Based on our understanding of engineering and the characteristics of audio and video processing algorithms, the overall architecture of the audio and video processing platform of cloud music is listed below:

We have made a unified design for the common parts of different audio and video algorithm processing, including the visualization, monitoring, quick start, and processing data statistics of algorithm processing. We have designed a unified and configurable management mode for the allocation of resources, so the common parts of the whole system can be abstracted and reused as much as possible.

The most critical aspect of the entire audio and video algorithm processing platform is the interaction and design of the execution unit, which is also the focus of this article. Cloud music solves many connection and deployment efficiency issues through unified connection standards and image delivery. For the use of resources, since we continuously have new algorithms and stock/incremental services, we used internal private cloud host application/recycling and content containerization before migrating to the cloud.

In order to describe the operation process of the cloud music execution unit better, we will refine it in the following figure.

The interaction between the execution unit and other systems is decoupled through message queues. In the execution unit, we adapt to algorithms with a different concurrent performance by controlling the concurrency of message queues and trying to control the main work of the execution unit only for algorithm computing. As such, we can finally achieve the minimum granularity scale-out when the system is scaled out.

In this mode, we have landed more than 60 audio and video algorithms. Over the past year, service-oriented algorithms accounted for half of the 60 audio and video algorithms. These algorithms provide service capabilities to more than 100 of our business scenarios. However, more complex algorithms and business scenarios put higher requirements on our service efficiency, O&M deployment, and elasticity. Before we migrated to the cloud, we used more than 1000 ECS instances and physical machines of different specifications.

Selection

As the complexity of business scenarios and algorithms increases, many methods are used to simplify the connection between internal business scenarios and algorithms. However, more algorithms that mix with stock and incremental processing, the scale of business scenarios with various traffic, and the same type of algorithms may reuse the same algorithm. This makes us spend more time processing machine resources than we spend on development.

This also prompted us to start thinking about more ways to solve the problems we encountered. There are three pain points:

The first pain point is that the difference between stock and increment increases, and the implementation of new algorithms increases. We spend more time coordinating stock and incrementing resources. Secondly, with the increase in algorithm complexity, we need to pay attention to the overall specifications and utilization rate of machines when applying for/purchasing machines. Finally, we hope that the processing of stocks can be accelerated, and there are enough resources when processing it. When processing a large amount of audio and video data, we can compress the time when the stock is inconsistent with the increment. In general, we hope to have large-scale elastic resources, so audio and video algorithm services do not need to pay much attention to machine management.

However, the actual transformation is concerned with the final service capacity but also needs to consider the ROI. Specifically:

Cost: This includes two aspects: the implementation cost of transformation and the cost of computing resources. The former can be evaluated in combination with specific plans to obtain the required investment a person per day. In addition, the flexible scale-out in the future after the transformation is a point we need to consider. The latter can be estimated using the official cost computing model given by the cloud vendor and our execution data. The key to our selection in terms of cost is that future IT costs will not increase significantly if the transformation cost is acceptable.
Support for the Operating Environment: As mentioned earlier, the operating environment is diversified and deployed by image delivery. The team has complete CICD support, which requires future upgrades and deployment transactions. We hope that after the transformation, we do not need to spend too much time and energy on such matters and pay more attention to the algorithm execution.
Elasticity: In addition to the scale of computing resource pools provided by cloud vendors, we focus on the startup speed of elastic computing power, whether it can reserve instances for fixed scenarios, and whether it provides flexible elasticity that meets business requirements to support business development.

These are consistent with the definition of Serverless. You do not need to manage servers to build and run applications with excellent elasticity. Based on the considerations above, we chose the public cloud Function Compute (FC) method. It can visually map our current computing execution process and meet the needs of future attempts to use schemas for algorithm orchestration. Next, I will focus on introducing Function Compute.

Outlook

We quickly trialed FC for a week. However, a complete, highly reliable architecture requires more considerations. The overall external input and output of the system remain unchanged, and the system has traffic control capability. In addition, the system can be downgraded to the private cloud for processing in case of special circumstances to ensure the high reliability of the system. The specific architectural transformation is shown in the following figure:

The focus of the transformation is to make the development environment of NetEase Cloud Music adapt to Function Compute. We focused on the transformation of deployment, monitoring, and hybrid cloud. In terms of deployment, we fully applied the support of Function Compute on CI/CD and image deployment to realize automatic image pulling. In terms of monitoring design, we use the monitoring and alert function on the cloud and convert it into the parameters of our existing internal monitoring system, so the overall development and operation and maintenance processing can maintain consistency. We consider that the code design can be compatible with the implementation of hybrid cloud deployment. Finally, we completed the Serverless transformation of our audio and video processing platform.

The Function Compute (FC) billing policy shows us that three major factors affect the final cost: the memory specifications, the number of times to trigger computing, and the cost of outbound internet traffic. The technical architecture needs us to pay more attention to the first two. The traffic is also costly. This is also a focus for us.

According to the cost characteristics of Function Compute (FC), under the condition that the storage system still uses NetEase's private cloud, in the first stage, we selected the audio and video algorithm with less outbound internet traffic. As for the relatively small amount of outbound internet traffic, let me give an example to extract the characteristics of the audio. For example, if audio goes in and extracts a 256-dimensional array, the obtained result is only a 256-dimensional array, which is much smaller than the audio traffic. Thus, the cost of outbound internet traffic will be small.

In the first stage of introducing Function Compute, the algorithm of the feature extraction class was improved by ten. The algorithm of sparse class can be understood as the algorithm with a low daily utilization rate, which has been greatly saved in cost. In addition, the startup speed of our nodes is optimized through the Function Compute image cache acceleration capability, so all services can be pulled up in seconds. These efforts reduce a large amount of O&M costs in algorithm O&M processing, allowing us to focus more on algorithms and businesses.

The figure on the right side above shows an operation example of one of the cloud music algorithms. As you can see, we have a large range of elastic changes, and Function Compute meets this requirement.

In the future, we hope to use Serverless technology to liberate our investment in O&M and solve the problem of outbound Internet traffic from storage, so audio and video algorithms in more scenarios can be implemented naturally. Secondly, as the complexity of algorithms increases, the use of computing resources becomes more complex. We hope to optimize the computing process using GPU instances. Lastly, there are more real-time audio and video processing scenarios in the cloud music business scenarios. Similarly, it has clear peak and trough fluctuations. We hope to accumulate more experience using Serverless services and help the development of real-time audio and video technology for cloud music.

Author Introduction

Liao Xiangli is the Head of Research and Development of the NetEase Cloud Music Library. He joined NetEase Cloud in 2015.

Community

NetEase Cloud Music: Explore Serverless Solutions for Audio and Video Algorithms

Current Situation

Selection

Outlook

Author Introduction

Read previous post:

Read next post:

Alibaba Cloud Serverless

You may also like

Comments

Alibaba Cloud Serverless

Related Products

Function Compute

ECS(Elastic Compute Service)

Serverless Workflow

Serverless Application Engine