All Products
Search
Document Center

Intelligent Speech Interaction:Timestamp feature

Last Updated:Sep 09, 2020

The speech synthesis service generates a timestamp, which indicates the time point on an audio stream, for each word in a sentence. The timestamp feature of speech synthesis is also named phoneme boundary detection for each word. The timestamp can be used for the virtual speakers and video subtitles.

Notice

This feature can be used only for speakers that support phoneme boundary detection for each word.

Request parameters

To enable the timestamp feature, set the request parameter enable_subtitle to true when you initiate a request on the client.

Assume that you are using the SDK for Java. You can use the following configuration:

// Specify whether to enable the timestamp feature to return the corresponding timestamps of the text to be sent. By default, this feature is not enabled.  
synthesizer.addCustomedParam("enable_subtitle", true);

Server response

If you set the enable_subtitle parameter to true in a request, the server returns a MetaInfo event that contains the timestamps corresponding to the text that is sent.

Parameter

Type

Description

subtitles

List

The information about the timestamps.

The following table describes the parameters contained in subtitles.

Parameter

Type

Description

text

String

The word in the text that is sent.

begin_time

Integer

The start timestamp of the word in the synthesized audio data, in milliseconds.

end_time

Integer

The end timestamp of the word in the synthesized audio data, in milliseconds.

Sample output

{
    "header": {
        "message_id": "05450bf69c53413f8d88aed1ee60****",
        "task_id": "640bc797bb684bd6960185651307****",
        "namespace": "SpeechSynthesizer",
        "name": "MetaInfo",
        "status": 20000000,
        "status_message": "GATEWAY|SUCCESS|Success."
    },
    "payload": {
        "subtitles": [
            {
                "text": "xx",
                "begin_time": 130,
                "end_time": 260
            },
            {
                "text": "xx",
                "begin_time": 260,
                "end_time": 370
            }
        ]
    }
}

Note

  • The speech synthesis service returns subtitles based on how the original text is read. Therefore, you cannot use the video subtitles generated by the timestamp feature as subtitles on the screen. Instead, you must use the original text.

  • If you use this feature to generate video subtitles, you can obtain the start and end timestamps of each sentence based on the returned response.