All Products
Search
Document Center

Intelligent Speech Interaction:SDK for Java

Last Updated:Sep 18, 2023

The real-time speech recognition service provides an SDK for Java. This topic describes how to download and install the SDK. This topic also provides sample code for you to use the SDK.

Precautions

  • Before you use the SDK, make sure that you understand how the SDK works. For more information, see Overview.

  • The nls-sdk-long-asr SDK is renamed as nls-sdk-transcriber since version 2.1.0. If you use the nls-sdk-long-asr SDK and need to upgrade the SDK, you must delete it and add callbacks as prompted.

Download and installation

Download the latest version of the SDK from the Maven repository and the nls-sdk-java-demo package.

<dependency>    
      <groupId>com.alibaba.nls</groupId>  
      <artifactId>nls-sdk-transcriber</artifactId>   
      <version>2.1.6</version>
</dependency>

Decompress the .zip demo package. Run the mvn package command from the pom directory. An executable JAR package named nls-example-transcriber-2.0.0-jar-with-dependencies.jar is generated in the target directory. Copy the JAR package to the destination server. You can use the JAR package for quick service validation and stress testing.

Service validation:

Run the following command and set parameters as prompted.

Then, the logs/nls.log file is generated in the directory where the command is run.

java -cp nls-example-transcriber-2.0.0-jar-with-dependencies.jar com.alibaba.nls.client.SpeechTranscriberDemo

Stress testing:

Run the following command and set parameters as prompted.

Set the service URL to wss://nls-gateway.cn-shanghai.aliyuncs.com/ws/v1. Provide .pcm audio files with a sampling rate of 16,000 Hz. Set the maximum number of concurrent calls based on your purchased resources.

java -jar nls-example-transcriber-2.0.0-jar-with-dependencies.jar
Note

Charges are incurred if you make more than two concurrent calls to perform stress testing.

Key objects

  • NlsClient: the speech processing client. You can use this client to process short sentence recognition, real-time speech recognition, and speech synthesis tasks. This object is thread-safe. You can globally create one NlsClient object.

  • SpeechTranscriber: the real-time speech recognition object. You can use this object to set request parameters, send a request, and send audio data. This object is not thread-safe.

  • SpeechTranscriberListener: the recognition result listener, which listens to recognition results. This object is not thread-safe.

For more information, see Java API overview.

Important

Notes on SDK calls:

  • Based on Netty, creation of an NlsClient object consumes time and resources, but the created NlsClient object can be reused. We recommend that you create and disable an NlsClient object based on the lifecycle of your project.

  • The SpeechTranscriber object cannot be reused. You must create a SpeechTranscriber object for each recognition task. For example, to process N audio files, you must create N SpeechTranscriber objects to complete N recognition tasks.

  • One SpeechTranscriberListener object corresponds to one SpeechTranscriber object. You cannot use one SpeechTranscriberListener object for multiple SpeechTranscriber objects. Otherwise, you may fail to distinguish recognition tasks.

  • The SDK for Java depends on Netty. If your application is dependent on Netty, make sure that the version of Netty is 4.1.17.Final or later.

Sample code

Note

  • Download the nls-sample-16k.wav file.

    The demo uses an audio file with a sampling rate of 16,000 Hz. To obtain an accurate recognition result, set the model to universal model for the project to which the appkey is bound in the Intelligent Speech Interaction console. In actual use, you can select a model based on the audio sampling rate. For more information about model setting, see Manage projects.

  • The demo uses the default Internet access URL built in the SDK to access the real-time speech recognition service. To use an Elastic Compute Service (ECS) instance in the China (Shanghai) region to access this service in an internal network, you must set the URL for internal access when you create the NlsClient object.

    client = new NlsClient("ws://nls-gateway.cn-shanghai-internal.aliyuncs.com/ws/v1", accessToken);
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import com.alibaba.nls.client.protocol.InputFormatEnum;
import com.alibaba.nls.client.protocol.NlsClient;
import com.alibaba.nls.client.protocol.SampleRateEnum;
import com.alibaba.nls.client.protocol.asr.SpeechTranscriber;
import com.alibaba.nls.client.protocol.asr.SpeechTranscriberListener;
import com.alibaba.nls.client.protocol.asr.SpeechTranscriberResponse;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
 * The sample code demonstrates how to perform the following operations:
 * Call the API of the real-time speech recognition service.
 * Dynamically obtain a token.
 * Use local files to simulate the sending of real-time streams.
 * Calculate time consumed for recognition.
 */
public class SpeechTranscriberDemo {
    private String appKey;
    private NlsClient client;
    private static final Logger logger = LoggerFactory.getLogger(SpeechTranscriberDemo.class);

    public SpeechTranscriberDemo(String appKey, String id, String secret, String url) {
        this.appKey = appKey;
        // Globally create an NlsClient object. The default endpoint is the Internet access URL of the real-time speech recognition service.
        // Obtain a token. You must obtain another token before the current token expires. You can call the accessToken.getExpireTime() method to query the expiration time of a token.
        AccessToken accessToken = new AccessToken(id, secret);
        try {
            accessToken.apply();
            System.out.println("get token: " + ", expire time: " + accessToken.getExpireTime());
            if(url.isEmpty()) {
                client = new NlsClient(accessToken.getToken());
            }else {
                client = new NlsClient(url, accessToken.getToken());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private static SpeechTranscriberListener getTranscriberListener() {
        SpeechTranscriberListener listener = new SpeechTranscriberListener() {
            // Return intermediate results. This message is returned only if the setEnableIntermediateResult parameter is set to true.
            @Override
            public void onTranscriptionResultChange(SpeechTranscriberResponse response) {
                System.out.println("task_id: " + response.getTaskId() +
                    ", name: " + response.getName() +
                    // The status code. The code 20000000 indicates that the request is successful.
                    ", status: " + response.getStatus() +
                    // The sequence number of the sentence, which starts from 1.
                    ", index: " + response.getTransSentenceIndex() +
                    // The recognition result of the sentence.
                    ", result: " + response.getTransSentenceText() +
                    // The duration of the processed audio data. Unit: milliseconds.
                    ", time: " + response.getTransSentenceTime());
            }

            @Override
            public void onTranscriberStart(SpeechTranscriberResponse response) {
                // task_id is the unique identifier that indicates the interaction between the caller and the server. If an error occurs, you can submit a ticket and provide the task ID to Alibaba Cloud to facilitate troubleshooting.
                System.out.println("task_id: " + response.getTaskId() + ", name: " + response.getName() + ", status: " + response.getStatus());
            }

            @Override
            public void onSentenceBegin(SpeechTranscriberResponse response) {
                System.out.println("task_id: " + response.getTaskId() + ", name: " + response.getName() + ", status: " + response.getStatus());

            }

            // Recognize a complete sentence. The server can detect the beginning and end of a sentence. When the server detects the end of the sentence, it returns this message.
            @Override
            public void onSentenceEnd(SpeechTranscriberResponse response) {
                System.out.println("task_id: " + response.getTaskId() +
                    ", name: " + response.getName() +
                    // The status code. The code 20000000 indicates that the request is successful.
                    ", status: " + response.getStatus() +
                    // The sequence number of the sentence, which starts from 1.
                    ", index: " + response.getTransSentenceIndex() +
                    // The recognition result of the sentence.
                    ", result: " + response.getTransSentenceText() +
                    // The confidence level.
                    ", confidence: " + response.getConfidence() +
                    // The time when the server detects the beginning of the sentence.
                    ", begin_time: " + response.getSentenceBeginTime() +
                    // The duration of the processed audio data. Unit: milliseconds.
                    ", time: " + response.getTransSentenceTime());
            }

            // Indicate that the recognition is completed.
            @Override
            public void onTranscriptionComplete(SpeechTranscriberResponse response) {
                System.out.println("task_id: " + response.getTaskId() + ", name: " + response.getName() + ", status: " + response.getStatus());
            }

            @Override
            public void onFail(SpeechTranscriberResponse response) {
                // task_id is the unique identifier that indicates the interaction between the caller and the server. If an error occurs, you can submit a ticket and provide the task ID to Alibaba Cloud to facilitate troubleshooting.
                System.out.println("task_id: " + response.getTaskId() +  ", status: " + response.getStatus() + ", status_text: " + response.getStatusText());
            }
        };

        return listener;
    }

    // Calculate the equivalent voice length based on the binary data size.
    // Set the sampling rate to 8,000 Hz or 16,000 Hz.
    public static int getSleepDelta(int dataSize, int sampleRate) {
        // Set the sampling size to 16-bit.
        int sampleBytes = 16;
        // Use a single audio channel.
        int soundChannel = 1;
        return (dataSize * 10 * 8000) / (160 * sampleRate);
    }

    public void process(String filepath) {
        SpeechTranscriber transcriber = null;
        try {
            // Create an object and establish a connection.
            transcriber = new SpeechTranscriber(client, getTranscriberListener());
            transcriber.setAppKey(appKey);
            // Specify the audio coding format.
            transcriber.setFormat(InputFormatEnum.PCM);
            // Specify the audio sampling rate.
            transcriber.setSampleRate(SampleRateEnum.SAMPLE_RATE_16K);
            // Specify whether to return intermediate results.
            transcriber.setEnableIntermediateResult(false);
            // Specify whether to add punctuation marks to the recognition result.
            transcriber.setEnablePunctuation(true);
            // Specify whether to enable inverse text normalization (INT). A value of true indicates that Chinese numerals are converted to Arabic numerals.
            transcriber.setEnableITN(false);

            // Specify the threshold for detecting the end of a sentence. Unit: milliseconds. Valid values: 200 to 2000. Default value: 800.
            //transcriber.addCustomedParam("max_sentence_silence", 600);
            // Specify whether to enable voice activity detection (VAD) to determine the end of a sentence.
            //transcriber.addCustomedParam("enable_semantic_sentence_detection",false);
            // Specify whether to enable disfluency detection.
            //transcriber.addCustomedParam("disfluency",true);
            // Specify whether to return information about words.
            //transcriber.addCustomedParam("enable_words",true);
           // Specify the threshold for recognizing audio streams as noise. Valid values: -1 to 1.
            // The closer the parameter value is to -1, the more likely an audio stream is recognized as a normal speech. That is, noise is more likely recognized as normal speeches and processed by the service.
            // The closer the parameter value is to 1, the more likely an audio stream is recognized as noise. That is, normal speeches are more likely recognized as noise and ignored by the service.
            // This parameter is an advanced parameter. Proceed with caution when you adjust the parameter value. Perform a test after you adjust the parameter value.
            //transcriber.addCustomedParam("speech_noise_threshold",0.3);
            // Specify the ID of the custom model after training.
            // transcriber.addCustomedParam("customization_id","ID of the custom model");
            // Specify the vocabulary ID of custom hotwords after training.
            // transcriber.addCustomedParam("vocabulary_id","vocabulary ID of custom hotwords");
            // Specify whether to ignore the recognition timeout issue of a single sentence.
            transcriber.addCustomedParam("enable_ignore_sentence_timeout",false);
            // Specify whether to enable post-processing for VAD.
            //transcriber.addCustomedParam("enable_vad_unify_post",false);

            // Serialize preceding parameter settings to the JSON format. Then, send the JSON file to the server for confirmation.
            transcriber.start();

            File file = new File(filepath);
            FileInputStream fis = new FileInputStream(file);
            byte[] b = new byte[3200];
            int len;
            while ((len = fis.read(b)) > 0) {
                logger.info("send data pack length: " + len);
                transcriber.send(b, len);
                // In this example, local files are read to simulate real-time speech data streams. You must set the sleep duration because files are fast read.
                // To recognize real-time speech, you do not need to set the sleep duration. If the audio sampling rate is 8,000 Hz, you must set the second parameter to 8000.
                int deltaSleep = getSleepDelta(len, 16000);
                Thread.sleep(deltaSleep);
            }

            // Notify the server that audio data has been sent. Wait until the server completes processing.
            long now = System.currentTimeMillis();
            logger.info("ASR wait for complete");
            transcriber.stop();
            logger.info("ASR latency : " + (System.currentTimeMillis() - now) + " ms");
        } catch (Exception e) {
            System.err.println(e.getMessage());
        } finally {
            if (null ! = transcriber) {
                transcriber.close();
            }
        }
    }

    public void shutdown() {
        client.shutdown();
    }

    public static void main(String[] args) throws Exception {
        String appKey = null; // Enter the appkey.
        String id = null; // Enter the AccessKey ID.
        String secret = null; // Enter the AccessKey secret.
        String url = ""; // Default value: wss://nls-gateway.cn-shanghai.aliyuncs.com/ws/v1.

        if (args.length == 3) {
            appKey   = args[0];
            id       = args[1];
            secret   = args[2];
        } else if (args.length == 4) {
            appKey   = args[0];
            id       = args[1];
            secret   = args[2];
            url      = args[3];
        } else {
            System.err.println("run error, need params(url is optional): " + "<app-key> <AccessKeyId> <AccessKeySecret> [url]");
            System.exit(-1);
        }
        // In this example, local files are used to simulate the sending of real-time streams. In actual use, you can collect or receive real-time speech data streams and send them to the ASR server.
        String filepath = "nls-sample-16k.wav";
        SpeechTranscriberDemo demo = new SpeechTranscriberDemo(appKey, id, secret, url);
        demo.process(filepath);
        demo.shutdown();
    }
}