Overview - - 阿里云

To cope with the rapid development of intelligent hardware and customers' urgent needs for speech interaction, Alibaba Cloud develops the Natural User Interaction (NUI) SDK that supports comprehensive speech processing. The NUI SDK integrates with the core algorithms of speech processing services that run on mobile clients and cloud servers. The NUI SDK provides the following comprehensive features: far-field speech signal processing, wake-up word recognition, speech recognition, semantic understanding, and speech synthesis. The NUI SDK provides easy-to-use interfaces for you to get started with your speech processing services.

Notice

The prebuilt version of the NUI SDK is applicable only to the Linux operating system that runs on the ATS3605(D) chip designed by Actions. The NUI SDK can concurrently process three voice channels as the input, including two recording voice channels and one reference voice channel. The algorithms take effect only on specific device models. More chips will be supported in the future.

Description

The NUI SDK is applicable to different scenarios from those of common SDKs of Intelligent Speech Interaction. To be specific, the common SDKs can be used for short sentence recognition, real-time speech recognition, speech synthesis, and long-text-to-speech synthesis. The NUI SDK is commonly used for intelligent hardware devices that require speech interaction in both near and far fields, including intelligent speakers, storytelling machines for child education, and IoT home appliances. Compared with common SDKs, the NUI SDK provides more comprehensive speech processing capabilities and an E2E solution targeting at far-field speech processing.

Features of the NUI SDK

Far-field speech signal processing
During far-field speech signal processing, intelligent devices are often affected by adverse acoustic factors, such as device echoes, human voices, environmental noise, and indoor reverberations. To combat adverse acoustic factors, the NUI SDK uses an audio frontend system to intensify the original audio stream and improve the signal-to-noise ratio (SNR) and intelligibility of the speech signal to be processed. In this way, the interaction between the user and device or between the users is enhanced.
Wake-up word recognition
The NUI SDK allows you to specify custom wake-up words for the wake-up word recognition model. When the NUI SDK detects that a specified wake-up word is spoken, the NUI SDK sends a wake-up signal to the client. You can specify multiple wake-up words and command words. It takes about two to three weeks to complete the process from recording wake-up words to training the wake-up word recognition model.
VAD
To save computing resources and reduce power consumption on the device side, the NUI SDK uses the built-in voice activity detection (VAD) feature to check whether the received audio data contains human voice. The NUI SDK sends only audio data that contains human voice to the server for speech recognition.

Short sentence recognition
The NUI SDK supports short sentence recognition to recognize speech that lasts within 60 seconds in real time. Short sentence recognition applies to scenarios such as chat conversation, voice command control, voice search in applications, and short speech message sending.

Speech synthesis
Speech synthesis is developed based on deep learning technology to convert text to human-like and fluent speech. You can specify the speaker, speed, intonation, and volume of the generated speech.

Differences among the NUI SDK and common SDKs

Item	Speech recognition SDK (applicable to short sentence recognition, real-time speech recognition, and recording file recognition)	Speech synthesis SDK (applicable to speech synthesis and long-text-to-speech synthesis)	NUI SDK
Wake-up word recognition with echo elimination	×	×	√
Far-field noise reduction	×	×	√
Command word and shortcut word setting	×	×	√
VAD	×	×	√
Speech recognition	√	√	√
Speech synthesis	√	√	√
Billing rule	The real-time speech recognition and recording file recognition services are billed based on the duration of the processed audio data. The short sentence recognition service is billed based on the number of times that the service is called.	Billed based on the number of times that a specific service is called or the number of processed words.	Billed based on the number of devices on which the NUI SDK is activated.

Endpoints

Access type	Description	URL
External access from the Internet	This endpoint allows you to access an Intelligent Speech Interaction service from any host by using the Internet. By default, the Internet access URL is built in the NUI SDK.	wss://nls-gateway.ap-southeast-1.aliyuncs.com/ws/v1

Possible statuses during the interaction process

The NUI SDK can be in one of the following states during the interaction process:

UNINIT: The NUI SDK is not initialized. This is the default status.
STOP: The NUI SDK is paused. After the NUI SDK is initialized, it enters the STOP state.
IDLE: The NUI SDK is waiting to be called. When the NUI SDK is in the IDLE state, it starts to receive audio data and wakes up the client. The NUI SDK remains in the IDLE state when a wake-up event occurs. You can call the interactive method to force the NUI SDK to enter the INTERACTIVE state.
INTERACTIVE: The NUI SDK is recognizing audio data. When the NUI SDK is in the INTERACTIVE state, it starts to receive and recognize audio data. If the recognition task is completed or fails, the NUI SDK enters the IDLE state.

The following table describes the features that each status supports.

Feature	UNINIT	STOP	IDLE	INTERACTIVE
Receives audio data	No	No	Yes	Yes
Supports wake-up word recognition	No	No	Yes	Yes
Supports speech recognition	No	No	No	Yes