The Birth of a Virtual Idol - Digital Human Industry and Technology Research

Digital people we understand
In recent years, the concept of digital humanities has become extremely popular on the Internet, and many popular virtual digital human or virtual idols have appeared in the industry, such as Lil Miquela, Luo Tianyi, Hatsune, Luming, Nuan Nuan, AYAYI, etc. The business model or market value of these digital people is usually: Accumulate traffic through operations and rely on traffic to monetize. E.g:

Brand endorsement (IP, event)
Fan economy (two dimensions)
Virtual anchor (game, bring goods)
This article defines three core elements for the digital people mentioned next:
1. Shape - has a human or anthropomorphic appearance, with a specific appearance and other character characteristics.
2. Movement - has behavior similar to that of human beings, and has the ability to express with language, facial expressions, and body.
3. God - has a human-like mind, has the ability to recognize the external environment, and can communicate and interact with people. These three elements are a progressive relationship to the "completeness" of the virtual digital human.
market situation
In recent years, virtual digital humans have various market sizes in e-commerce, finance, film and television, games and finance industries. For example , the market size of the virtual idol industry in China in 2020 is 3.46 billion yuan, and it is expected to reach 6.22 billion yuan in 2021.
The growth of the market scale also confirms the development of technology: the production cost is reduced year by year, the image and appearance are more realistic, and the language communication is more natural. Since the emergence of virtual digital humans, three important periods have passed:
Start-up period: The market is growing at the beginning, the technology is uncertain, and the entry threshold is high.
Development period: The number of market competitors increases, the technology is gradually formed, and the entry threshold is low.
Platform period: The market red ocean is coming, platformization is mature, leader + niche.
solution
In today's platform era, various manufacturers have also come together to provide solutions at different levels:

Basic layer: Provide basic software and hardware support for virtual digital humans. The hardware includes display devices, optical devices, sensors, chips, etc., and the basic software includes modeling software and rendering engines. Only a few top technology companies have excellent software and hardware capabilities.

Platform layer: including software and hardware systems, production technology service platforms, and AI capability platforms, providing technical capabilities for the production and development of virtual images. There are many companies that provide platform services, selling services and technologies to serve other companies. Application layer: In addition to the final enterprise users, some companies or teams with excellent marketing and operation capabilities also bring good ideas and ideas to this industry.

What are we doing

The virtual character group was established in the interactive graphics direction of the Ali front-end committee. This group consists of the following teams: Da Taobao Interactive Team, Dharma Academy Smart Digital Human Team, Youku Digital Human Production and Broadcasting Team, Koala Interactive & content shopping guide team, Ant Digital, and financial content community team. Together they share and research technologies and applications in the field of virtual digital humans. The business involves three main scenarios: games, videos, and live broadcasts.

Games: Virtual digital humans are basically standard in the game industry, and many games will need to shape characters. Among them, this type
of game allows players to customize the character image function. The function of customizing the character image is also called "pinch people"

Taobao Life: This is a game in the Taobao app that can create an avatar. It includes functions such as pinching your face, dressing up, beautifying your face, taking pictures, shopping, and home.
Raising koalas: This is a game in the Koala Overseas Shopping App that can cultivate koalas, including dressing functions, feeding, and other gameplay.
They all use Web-based technical solutions and complete the rendering, expressions, actions, and other shapes and actions of characters through self-developed engines.
Video: Short videos of virtual digital humans can bring users an excellent sensory experience, and can also bring incremental benefits to the business. When making a video of a virtual digital human, motion capture, intelligent recognition, and a director system can be used to make the virtual digital human come to life .
Live broadcast: The combination of live broadcast and virtual digital human is still in the early stage of exploration because the two phenomenal industries need to be combined and it is not easy to form a new or incremental business model. The technologies involved also include real-time motion capture, algorithm training and synthesis, cloud rendering and streaming in live broadcast scenarios, and so on.

Let's create together

With the professional and in-depth application scenarios, technical research will also cover a variety of comprehensive solutions such as engineering and algorithms, and in different application scenarios, the focus will be different. Next, take the Taobao Life business of the Da Taobao Interactive Team as an example, and show you how to create a super virtual idol from the six themes of art production, rendering style, face- pinching , facial expression, director system and speech synthesis.
shape
In this chapter, we will complete the shape of the virtual digital human - "has the appearance of a person or an anthropomorphic person, and has a specific appearance and other character characteristics".

Carved from Mold - Artistic Workflowr

Basic physical characteristics of the virtual digital human, such as using a 7-headed body in real life proportion, or a 5-headed body in cartoon proportion; male, female, or animal anthropomorphism, etc. After the basic shape features are determined, the basic shape can be produced by 3D art. The whole process is usually completed in traditional DCC software, but the biggest difference between 3D art and 2D art is that in the art production of 2D content, technology can intervene after the art product is delivered; in the art production of 3D content, Technology needs to be directly involved in the production process. The reason is: the art production process of 3D content is relatively long and complex, and frequent cooperation between art and technology is required to ensure delivery quality or delivery efficiency. We also call this process the 3D art workflow.  
To give a visual example: an artist starts to design and engrave a cup mold, and the production line needs to solve "technical problems" such as what material the mold uses, how to inject the material into the mold, and how to release the mold smoothly. Usually these are the responsibility of the technology, and it is necessary to negotiate the specifications of the mold with the artist in advance to ensure the smooth progress of the subsequent projects, and finally complete the delivery of the mold. There are also many similar solutions for 3D art workflow. The main reason is that the production of 3D content also has certain industrial standards. The small difference is that the details are closely related to the software used by the artist and the engine implemented by the technology. Take the art workflow in Taobao's life as an example to roughly go through these steps:
1. Use Maya to make white molds and bones, temporarily store the process products in OSS, and provide preview tools
2. Create textures with Photoshop and upload the textures to CDN
3. Customize the GLTF Exporter plug-in in Maya to export glTF (including model data, bone data, materials, texture data)
4. The embedded material editor on the web side adjusts the effect of self-developed materials
5.glTF of the human body through the GLTF Importer in the EVA Figure engine, and customize the material Shader for rendering.
After running-in with artists in the early stage, a set of art workflows for current needs will eventually be formed and run stably.
Crafted - pinch face
After completing the base shape of the character, each of us can use it to continue to achieve the look we want. Although different looks can be completed in the art link, the cost is quite high, and everyone likes different looks, so many times of art production or modification are required. So we added a face pinching system to the basic shape , and you can easily complete the customization requirements . Taobao Life provides the function of pinching your face , you can experience the fun of pinching your face. The basis for the realization of face pinching technology is to locally modify and change the existing model data, and finally achieve the purpose of changing thousands of people. How can the purpose of changing the basic model be achieved through such a system? Usually, a set of model data is a collection of vertex data, and changing the model is usually changing the vertex data, and there are usually two ways to change the vertex data:
Bone skinning
Applying some kind of "transformation" to the vertices through "external force" is simply a set of mathematical formulas including three transformations of displacement, rotation and scaling. To accomplish this transformation, this external force can be done using "bones". The bones mentioned here can also be understood as the bones of the human body. When the action of the joints on the fingers changes, the shape of the hand will change. In the face pinching function , in Taobao Life , we preset about 20 kinds of bones for the face, which can change the head circumference, eyeballs, eye corners, eye sockets, cheekbones, face shape and so on.
The Birth of a Virtual Idol - Digital Human Industry and Technology Research
Blend Shape
The vertex transformation brought by the bones is very rough, and the customization of the mouth shape cannot be completed, because such a seemingly simple appearance actually involves tens of thousands of vertices in the model for different regular transformations. Therefore, we set up a deformer for this set of vertex transformations, which is generally called "Morph Target" or "Blend Shape" in the industry . The principle of this transformation is to prepare a reference position for the vertex, and then provide a maximum position after the " extreme change", and then multiply it by a certain "weight ratio", so that the vertex can be in either the reference position or the extreme position. Location. However, because there are too many places where the face needs to be locally deformed, and a deformation may involve tens of thousands of vertices, it is also not small for real-time calculation.    
Comparing the two techniques, skeletal skinning is simple and efficient but not flexible enough, while vertex deformation is more free but expensive to produce and compute. Therefore, in actual development, which uses bones and which uses vertex deformation, this is a trade-off between "effect" and "efficiency" , which needs to be looked at in detail and adjusted repeatedly. In Taobao Life , a small part of the "Pinch Face" function uses bone skinning, and most uses blending deformation. This is also the result of long-term running-in and accumulation of experience.
Fashion Outfits - Dress Up and Beauty
After we have the base body and looks, we have to dress the characters in stylish clothes and apply beautiful makeup. For humans, dressing and painting are "actions". In the virtual world, we still need to be able to complete the two important things of "wearing" and "painting" well.
Dress up
In real life, when the clothes are worn on the body, it is close to the skin or has a certain gap. This idea is actually very difficult to realize in the virtual world. Because skin and clothes are actually Mesh (mesh), when the clothes are worn on the body, two sets of meshes are actually "collided" together, which leads to the following two problems:
When the body moves, the clothes "do the movement" with the body. The body has bones, and a layer of "skin" is wrapped around the bones. In the same way, clothes are actually a layer of "skin" wrapped around the same bones. In Taobao Life, we use the same skeleton template for the body and clothes and realize the real-time "synchronization" of the two skeleton data during rendering.
How to fix the mesh of the body penetrating outside the clothes. Using the same skeleton solution to solve the effect of "wearing" clothes on the body is very ingenious, but it is also prone to problems. For example, if a certain piece of clothing is very concave, it is easy to have the problem that the skin of the body sticks out of the clothes, commonly known as "" Molding". Because the cost of carefully adjusting the props is too high, we also made a tricky plan: by "cutting" the human body and marking the parts of the human body that are covered by each piece of clothing, when rendering a certain piece of clothing, directly hide the hidden objects. You can block the Mesh of the part.
With these two technologies, we can mass-produce clothes through specifications and workflows related to art production agreements, and changing clothes is just loading different models, which can be achieved without special treatment one by one.
Beauty
The details of the makeup look are very demanding, so the most convenient way is to use textures. The face of the basic shape already has a basic texture, which can be simply understood as "no makeup"; to draw different makeup looks on the plain face, we do "dynamic synthesis of textures". The whole process is divided into two steps.
1.Render To Texture: The first step is to create a renderable object (Render Target), render the base texture map to this renderable object, and then render the makeup texture map to the same renderable object. In this step, it should be noted that when the artist draws the makeup texture, the UV of the base texture needs to be in one-to-one correspondence.
2.Using renderable textures: The second step is to render the synthesized textures to the model.
2D or 3D - Rendering Style
Finally it's time to style it. Some people will like the real style, some will like the cartoon style, some will like the punk style, and some will like the pure lust style . These different styles need to rely on the rendering function to complete. When we talk about rendering, we will mention the graphics rendering pipeline, which can be combined and adjusted according to different needs. For example, the simplest rendering pipeline is: load model ---> vertex shader ---> fragment shader- --> Rasterization . The step of fragment coloring is used to complete the drawing of the material texture to achieve the desired character style. The rendering of material textures is usually divided into two categories:
PBR
The full name is Physically Based Rendering. As the name implies, it is physically based rendering. Since it is based on physics, the final result of its rendering will be very close to our real world . So it is easy to understand that this type of material can determine whether the character is realistic or hyper-realistic . This technology consists of 8 core theories and several important lighting models, which are not listed here. Interested students can read the PBR-related chapters in "Real-Time Rendering", or "Physically Based Shading" by SIGGRAPH in Theory and Practice" series of articles. For example, in Taobao Life , the subsurface scattering technology is simulated by selecting the sampling color on the gradient map according to different parts, realizing the ruddy and transparent feeling of the human face.
NPR
The full name is Non-photorealistic Rendering, that is, non-photorealistic rendering. One of its main applications is the very popular two-dimensional style, especially the Japanese cartoon style. Unlike PBR, NPR does not pursue various physics simulations, but draws inspiration and inspiration from oil paintings, sketches, and cartoon animations. For example, character strokes, cartoon shading, edge lighting, hair highlights, etc. are often used. These special material renderings can also be found in professional papers and examples, you can search by yourself.
Move
In this chapter, we will complete the action of the virtual digital human - "has a behavior similar to that of a human, and has the ability to express with language, facial expressions and body".
Expressions and Actions
One of the keys to successfully shaping a virtual digital human is real and delicate expressions and movements. The expressions and movements of real people are presented by the interaction of bones and muscles. In the virtual world, we also use digital technology to simulate the functions of these bones and muscles to present expressions and movements. In the "Shape" chapter above, it has already been mentioned that the face uses bone skinning and blend deformation to change the vertices. In the "Motion" chapter, these two techniques will still be used, and through animation. Let the vertices "move" to complete the corresponding expressions and actions.
Hand K animation
In the characterization of facial expressions, vertex animation (that is, Morph Target Animation) is one of the main implementation techniques. In Taobao Life's "Pinch Face" function, skeletal skinning is used to determine the size of the face and the position of the facial features. Blending deformation is used to complete the deformation of facial features, including cheeks and foreheads. Among them, there are as many as 50 BlendShape used for expression animation, which is very close to the BlendShape used by Apple's Animoji. Expression (of course many micro-expressions are very difficult to achieve) settings.

In the production of body movements, skeletal animation (that is, Skeletal Animation) is one of the main implementation technologies. Skeleton animation contains two kinds of data: Skeleton and Skin. First, bind the vertices of the mesh to the bones to generate the skin. These vertices will be affected by multiple bones with different weights, and then generate animation by changing the orientation and position of the bones. At this time, the skin will move with the bones.
Motion Capture
The cost of producing animation is relatively high, because the above two animation techniques are usually key frame animation . Suppose a person lifts the elbow, then the arm and wrist will also have a coherent movement as a result. If you want to achieve such an animation, there will be a lot of keyframes , and it will be extremely difficult to make . In order to effectively reduce the production cost, we need to use motion capture technology. Usually, we divide the motion capture technology into 2 major directions and 4 categories, which can be described by a four-quadrant diagram:
The AR interactive games, AR masks and other functions that we complete with mobile phone cameras are actually technologies in the quadrant of optical recognition + wearable devices.
Choreography - Director System
Let's compare the cost and flexibility of several methods. The production cost of hand K animation is undoubtedly the highest, and the flexibility and effect are also the best. At the same time, it also requires experienced riggers and animators to complete. The production of motion capture requires a set of professional motion capture equipment , a venue that can accommodate these equipment and meet the required actions, and an action production cycle can range from a few hours to a few days. In the face of operational needs such as many dance scenes, the most suitable solution is to freely or intelligently arrange multiple completed movements into a script , just like the script is designed in advance by the director during filming, and the actors only need to follow the script. It is enough to perform, so it is also called the "director system".
The series of actions needs to solve a problem: how to transition from action 1 to action 2. This requires the use of blending animation (Blending Animation) technology. The basic principle of hybrid animation is to calculate the interpolation of key frames by taking the current state of action 1 as the starting point and the specified state of action 2 as the end point . Simple can use linear interpolation , complex can also use Bezier curve interpolation and so on . There are a lot of knowledge points and solutions about hybrid animation, which are also applicable to different scenarios and needs. You can search by yourself. For example, Unity and Unreal also provide many different mixing solutions.
With the director system, you can provide free combination capabilities. For example, if you want to hold a virtual concert, you can also complete the performance through the director system.
God
In this chapter, we will explore the god of virtual digital human - "has a human-like mind, has the ability to recognize the external environment, and can communicate and interact with people". We are still at a very early stage in the research on "God". On the one hand, it needs the support of big data, and on the other hand, it has a considerable distance from the front-end position. In order to make the virtual digital human more real, "God" will be the focus of future research.
Character Expression - Natural Speech Synthesis
The language expression ability of the virtual digital human requires the use of speech synthesis technology, such as TTS (Text To Speech) . Ali Dharma Academy has a very complete TTS engine, which can make virtual digital people speak. However, this is just speaking. In fact, everyone can feel that this kind of language is very pale and weak, without "emotion", and cannot express different tones with different personalities and emotions. Some excellent results can be seen in the industry: YAMAHA's vocal synthesis system " VOCALOID " ( used by Hatsune Miku and Luo Tianyi ), Google's deep learning-based end-to-end speech synthesis system " Tacotron ", and iFLYTEK 's speech synthesis system, etc. Ali Dharma Academy is also continuing to research a speech synthesis system that is more in line with natural expression and generates emotional tones with different styles by setting personalities for virtual digital people and using deep learning from big data.

So far, we have completed the debut of a super virtual idol under the existing technology.
The digital people we aspire to
Taobao Life is a digital human based on Web technology. After two years of technical polishing and upgrading, we also encountered the dilemma of Web digital human technology. In terms of performance, there is an inescapable gap between web applications and native applications. WebGL (based on OpenGL ES 2.0), as the main graphics interface of the Web, cannot catch up with Vulkan, DirectX and Metal in terms of capability or performance. In addition, there will still be strange compatibility issues on software and hardware on different mobile devices. This series of predicaments has become a difficult ceiling for digital people in the Web to cross.
In contrast, the digital human technology in the industry, hyper-realistic rendering, micro-expressions, muscle simulation, physical materials, ray tracing, etc., make us in the web technology unmatched. At the same time, Ali's virtual digital human technology has just started, and the basic software and hardware, middle-stage technology, and big data support all started late, which has brought us a lot of difficulties and resistance.
Faced with these dilemmas and gaps, we will also make efforts and attempts from all directions for the development of virtual digital human technology in Taobao Life .
The first is the optimization based on Web technology: based on the serverless rendering cloud service , with the combination of EVA Figure (virtual portrait rendering engine) and Puppeteer technology, with the help of the latest WebGPU /WASM, etc., the rendering effect and quality of the virtual digital human can be obtained. promote. We are also actively working with Alibaba Cloud cloud service team and Da Taobao Node architecture team to build cloud rendering process under Web technology. It is planned to be applied to some non-real-time rendering tasks, such as the production of full-body photos, short videos, action frames , etc. of Taobao's life user image. These products can also be used by Taobao Life or other businesses.
The second is the upgrade of business capabilities. In the soil where business and technology are mutually nourishing, business capabilities are continuously accumulated into platform services, contributing bit by bit experience to Ali's virtual digital human technology, and commercializing some solutions through commercialization. The cloud serves the public.
Under the radiation of industry wind vanes such as metaverse, hyperrealism, XR/6G, brain-computer interface , etc., imagine the possibility of future Web digital human technology.
At the end of the article, I would like to thank the members of the Alibaba Front-end Committee Graphic Interactive Virtual Character Team for their excellent work, allowing me to complete the content of this article. You are also welcome to continue to pay attention to the results of the virtual character group and each team.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00