EMO (Image to singing video) - Alibaba Cloud Model Studio

EMO generates high-quality, dynamic portrait videos from portrait images and audio files containing human voices. The service consists of two independent models: EMO-detect for compliance detection of portrait images and EMO for portrait video generation.

Important

This document applies only to the China (Beijing) region. To use the models, you must use an API key from the China (Beijing) region.

Model overview

Model introduction

EMO-detect is an image detection model used to check whether an input image meets the specifications of the EMO model.
EMO is a portrait video generation model that generates dynamic portrait videos from portrait images and audio files containing human voices.

Performance showcase

Input: A portrait image and an audio file of a human voice

Output: Dynamic portrait video

Portrait:

上春山

Audio: See the video on the right

Portrait video:

Action style used: Active ("style_level": "active")

Portrait:

15_原图

Audio: See the video on the right

Portrait video:

Action style used: Normal ("style_level": "normal")

Portrait:

娃哈哈

Audio: See the video on the right

Portrait video:

Action style used: Calm ("style_level": "calm")

Note

The preceding examples were generated by the Tongyi App, which integrates EMO.

Billing and rate limiting

Mode

Model

Unit price

QPS limit for the task submission API

Number of concurrent tasks

Model call

emo-detect-v1

Model call, pay-as-you-go:

$0.000574/image

No limit for sync APIs

emo-v1

Model call, pay-as-you-go:

Generate 1:1 aspect ratio video: $0.011469/second
Generate 3:4 aspect ratio video: $0.022937/second

(At any given time, only one job is running. Other jobs in the queue are waiting.)

Prerequisites

You have activated the service and obtained an API key. For more information, see Preparations: Configure API Key.

Model call

The EMO series models are available on a pay-as-you-go basis.
To call the models, follow these steps:
1. Call the EMO-detect model to confirm that the input portrait image meets the specifications. For more information, see EMO image detection.
2. Call the EMO model. Provide the original portrait image, the relevant image area parameters that are returned after the image passes detection, and an audio file that contains a clear human voice to generate a dynamic portrait video. For more information, see EMO video generation.