Community Blog Media AI: Practice of Intelligent Video Producing

Media AI: Practice of Intelligent Video Producing

This article highlights Hu Yao's presentation from the Apsara Conference 2020 about the technical capabilities and application practices of the Media AI platform.

Catch the replay of the Apsara Conference 2020 at this link!



At the Intelligent Entertainment Industry Practice Sub-Forum at the Apsara Conference 2020, Hu Yao, Senior Algorithm Expert of Alibaba's Digital Entertainment Group, did a presentation. Based on the content producing dilemma of video, he shared the technical capabilities and application practices of the Media AI platform. It will help improve the efficiency of the business side and content creation on consumer devices with automated, large-scale, and real-time video producing capabilities. By doing so, the Media AI platform can help achieve structural upgrades in the entertainment industry.

The following content is a summary of his presentation.

People's pace of life is accelerating year by year. As this happens, customers' time for fragmented consumption is increasing. With the popularity of mobile devices and accelerated network bandwidth, the short video industry was born. According to the latest data, there are more than 773 million short videos consumed every day, and the market size has exceeded 200 billion yuan.


The short video industry is thriving but still has problems with massive amounts of low-quality videos. There are millions of videos being created and released on various platforms every day. However, these videos have polarization characteristics. It means that high-quality professional generated content (PGC) created by professional creators are scarce, as these videos require professional refinement and massive preparation. As a result, most of the short videos have repeated content with low quality.

On the whole, it is difficult to produce good video content because of two aspects: creation and tools. Alibaba's Entertainment Group launched the Media AI platform this year to provide video producers with assistance on content platforms.

The Media AI platform has achieved the extraction of dynamic materials, production of video templates, intelligent editing technology, intelligent material processing, and interactive effects through AI technology.

The difference between these functions and traditional video editing is they are aimed at video production. For example, in the traditional AI processing methods, few technologies or products consider holistic segmentation and tracking of a sub-scene and sub lens. Also, there are very few products considering how to produce different videos with aesthetics based on the same footage. However, by understanding the business characteristics of the entertainment industry, Alibaba's Entertainment Group focuses on video production and editing. There is a lot of attention given to the footage structure, dynamic materials extraction, production of quick-look short videos, and the production and special effects of the template-type short videos. These enable the improvement of operational efficiency for intelligent business devices, the editing of quality content, and the automated production of a large number of trailers. They can also help the production and extraction of dynamic materials and build massive libraries of dynamic materials. In addition, Alibaba's Entertainment Group can implement automated and customized capture for the distribution platforms. By extracting videos' templates, high-quality short videos can be found through the whole network to produce short videos of similar styles. Alibaba's Entertainment Group has a large number of IP content, from which a large amount of secondary consumption content and new content can be derived. Moreover, for secondary editing capabilities, video editing capabilities are also available to users to help them produce better short videos at a lower cost.

Dynamic Material Extraction


Based on video structure, the Media AI platform can deeply analyze the multi-granularity information of videos and develop understanding capabilities of semantic events at a conceptual level. Traditional video and image structures are often aimed at more concrete information, such as people or things. For example, static keywords, such as girl and group photo, can be extracted based on traditional image structures, but this does not meet the material requirements for video creation. In video creation and distribution, creators need more conceptual keywords, such as hug, kiss, funny, sweet, and war. Such video structures with dynamic materials are more in line with the content presentation under the current 5G trend. Therefore, Alibaba's Entertainment Group has achieved the automatic extraction of creative materials. It can automatically extract semantic materials of varying intensity in frame level, lens level, and conceptual level. At the same time, materials can also be used for the automatic production of quality highlights. For example, if a creator wants to find short video clips of an idol, the platform can quickly generate a series of materials for the creator to carry out the next production.

Intelligent Material Processing


Based on the intelligent technologies of the Media AI platform, Alibaba's Entertainment Group has established a library with massive HD dynamic materials for the entertainment industry. For example, by using street dance IP, postures, actions, and cross-scene materials of each dancer can be extracted. It understands automatic and unmanned production and "hair-level" high-definition details. These technologies can be fully applied to the content production of PGC, providing streamers with more interactive tools in live broadcast interaction. These technologies can also provide powerful editing tools for customers.

Intelligent Production Diagram of Ultra-HD Dynamic Materials

Based on the content structure, Alibaba's Entertainment Group has combined AI with aesthetics. For example, a single set of materials is required for the design. Based on AI and aesthetics and vertical and horizontal materials in different sizes, a seamless switchover can be achieved by combining with players. By doing so, the user experience can be improved, operating costs can be reduced, and product iteration can be accelerated. This technology is available now.

Intelligent Editing Technology

The fragmented consuming demand of users can be better satisfied by story summaries with more granularity and exponential concentration. While watching a long video, users will continue watching IP content derived from long videos. The preference of IP content for everyone is different. For example, some female viewers prefer to watch "sweet" and "funny" stories, while others prefer "serious" and "sad" plots. Therefore, this technology can show different users different quick-look clips. This way, users can watch more in-depth content with their limited fragmented time.


Alibaba's Entertainment Group has quick-look clip editing with different durations from 15 seconds to 5 minutes. Moreover, customized clips in different types can be achieved based on the characteristics of users automatically. The complete extraction of events can be realized through the capability of cross-scene and cross-lens intelligent and automatic segmentation. Moreover, based on the capability of analyzing content structure, the automatic and perfect combination of the story and users' emotions meets the users' diversified viewing demands. At the same time, with the capability to automatically produce short videos with different durations, the production cost is very low, which brings new advertising opportunities for filling users' fragmented consumption of IP. Take Good Bye, My Princess as an example. In one episode, production teams can create a lot of different types of story summaries to meet different viewing experience needs for users, achieving better video consumption.

Template Video Production

When watching short videos, users tend to watch some commentaries of movies or TV shows. People will integrate ideas with personal style into the commentary, but the production cost is relatively high. So, Alibaba's Entertainment Group has developed an innovative new feature. When a producer enters a description of a scene or text, the Media AI platform can automatically generate a clip based on videos, text, and audio. By combining with massive amounts of high-quality IP content, the automatic production of massive content can be scaled and completed in real-time. As a result, automatically generated content has no quality difference from the content created by ordinary producers. In addition, when a producer inputs a piece of text arbitrarily, either created by the producer or edited by fans for the idol, the platform will automatically process the text in the original video. Then, the platform dubs subtitles to elaborate on the viewing experience of the producer, which is more three-dimensional and intuitive than the traditional bullet-screen comments.


There are a large number of interesting videos uploaded by users on the internet, all of which have models and templates. Alibaba relies on its deep accumulation of AI technology to intelligently extract templates for entertainment and creation. Through intelligent semantic analysis, the Media AI platform extracts a shooting template based on a video. Then, it produces more content with similar styles and presents it to users so users can watch several videos with various styles and rich content in a short time. The platform can also decompose a video into shooting elements at a scene or frame level. Moreover, it can create a template-based shooting script based on these elements and produce a video with a similar style.

Based on the previous technology accumulation, Alibaba's Entertainment Group has implemented libraries with rich algorithm materials, including plot materials, scene materials, character materials, and general-purpose materials. Template editing is achieved and popular shooting templates can be extracted for customers. For example, techniques for interweaving long and short shots, such as "an idol is particularly sweet" and "an idol is particularly energetic," can all be found in templates. This technology overcomes the shortcomings of high production costs and long production cycles, greatly reduces manpower, and can be replicated in batches for pipeline production. By accumulating technologies and products, Alibaba's Entertainment Group can provide material producing capabilities, such as AI-aided creation and design. Meanwhile, Alibaba's Entertainment Group has assisted the business-side operation to improve efficiency and has served the customers' creation to help achieve structural upgrades for the entertainment industry.

Future Prospects

In the future, Alibaba's Entertainment Group hopes to promote distribution more efficiently on the technical side and create better products and tools for the industry. Moreover, on the customer side, Alibaba's Entertainment Group will provide users with new experiences with more consumption patterns and interaction of videos. On the industry side, Alibaba's Entertainment Group hopes to cooperate with more business-side PGC or the Multi-Channel Network (MCN), helping the creation with accumulated experience and tools. At the same time, Alibaba's Entertainment Group hopes to gain creation experience from PGC and MCN to create a win-win situation.

0 0 0
Share on

Alibaba Clouder

2,600 posts | 754 followers

You may also like


Alibaba Clouder

2,600 posts | 754 followers

Related Products