EMO generates high-quality, dynamic portrait videos from portrait images and audio files containing human voices. The service consists of two independent models: EMO-detect for compliance detection of portrait images and EMO for portrait video generation.
This document applies only to the China (Beijing) region. To use the models, you must use an API key from the China (Beijing) region.
Model overview
Model introduction
EMO-detect is an image detection model used to check whether an input image meets the specifications of the EMO model.
EMO is a portrait video generation model that generates dynamic portrait videos from portrait images and audio files containing human voices.
Performance showcase
Input: A portrait image and an audio file of a human voice | Output: Dynamic portrait video |
Portrait:
Audio: See the video on the right | Portrait video: Action style used: Active ("style_level": "active") |
Portrait:
Audio: See the video on the right | Portrait video: Action style used: Normal ("style_level": "normal") |
Portrait:
Audio: See the video on the right | Portrait video: Action style used: Calm ("style_level": "calm") |
The preceding examples were generated by the Tongyi App, which integrates EMO.
Billing and rate limiting
Mode | Model | Unit price | QPS limit for the task submission API | Number of concurrent tasks |
Model call | emo-detect-v1 | Model call, pay-as-you-go: $0.000574/image | 5 | No limit for sync APIs |
emo-v1 | Model call, pay-as-you-go:
| 1 (At any given time, only one job is running. Other jobs in the queue are waiting.) |
Prerequisites
You have activated the service and obtained an API key. For more information, see Preparations: Configure API Key.
Model call
The EMO series models are available on a pay-as-you-go basis.
To call the models, follow these steps:
Call the EMO-detect model to confirm that the input portrait image meets the specifications. For more information, see EMO image detection.
Call the EMO model. Provide the original portrait image, the relevant image area parameters that are returned after the image passes detection, and an audio file that contains a clear human voice to generate a dynamic portrait video. For more information, see EMO video generation.


