Use Cool Edit or Adobe Audition to open the audio file and check the format of speech data. Play the audio file to listen to the sound and check the tracks, speech waveform, sound energy, and frequency spectrum. The standard data format of automatic speech recognition (ASR) is 16-bit mono audio sampled at the audio sampling rate of 8 kHz or 16 kHz. The recording file recognition service supports binaural audio.
Check whether the project configured in the Intelligent Speech Interaction console uses a model that supports the audio sampling rate and scenario of the speech data.
Play the audio file and listen carefully.
First, check for noise. If noise exists, determine whether it is made by humans, such as the distant voice of a person other than the speaker, or made by other objects such as knocks on a table, the noise when a door opens, or car horns.
Second, focus on the sound that you hear. Check whether the speaker pronounces words clearly and easy to recognize, swallows any sounds, speaks excessively fast, has a strong accent, or uses any dialects.
View the speech waveform, sound energy, and frequency spectrum. For the recording file recognition service, you also need to view the tracks.
- First, check whether the amplitude of the speech waveform is too large or too small. The following three figures use the speech data sampled at 8 kHz as an example.The following figure shows the normal speech waveform in green and the frequency band of the speech data in the frequency domain in red.
The following figure shows that the amplitude of the speech waveform is too small, where the sound energy is too low.
The following figure shows that the amplitude of the speech waveform is too large and exceeds the linear range of the system. In this case, limiting is required.
- Second, check whether the frequency band of the speech data sampled at 8 kHz or 16 kHz is complete in the frequency domain. Multiply the number corresponding to the frequency band by 2 to obtain the actual audio sampling rate of the speech data in kHz. The following figure uses the speech data sampled at 8 kHz as an example. However, the actual frequency band only covers up to 6 kHz, which is 3 kHz multiplied by 2. The speech data sampled at higher than 6 kHz is lost.
- Third, for the recording file recognition service, you also need to check whether the speech data is recorded in the same track or different tracks. For example, in a customer service scenario, if the voice of the customer is recorded in the same track as that of the agent, their voices may overlap. Therefore, their voices need to be recorded separately in two tracks to avoid overlapping.
Check whether hotwords or custom models are used.
- First, check whether categorized hotwords and extended hotwords are used. You need to limit the weight of extended hotwords.
- Second, check whether a custom model is used to optimize the speech recognition rate, and whether poorly recognized sentences are repeated more times in the training corpus to train and optimize the custom model.
Note: Speech recognition cannot achieve 100% accuracy and eliminate all bad cases.
Select a project whose model supports the audio sampling rate and scenario of your speech data.
If sound defects exist, for example, the speaker swallows sounds or pronounces words hard to recognize, the error cannot be identified as an ASR error.
- If the speaker uses any dialects or has a strong accent, the error may be caused by insufficient ASR training data.
- If you need to recognize large amounts of speech data that has a strong accent but does not involve any dialects, contact Alibaba Cloud Intelligent Speech Interaction engineers for help.
If noise made by humans is recognized by mistake, this error is hard to be resolved. The noise model always gives priority to human voices for ASR.
If noise made by other objects is recognized by mistake, you can collect more noise samples and provide them for Alibaba Cloud Intelligent Speech Interaction engineers to optimize the noise model.
If the amplitude of the speech waveform is small and the sound energy is low, the noise model may treat the speech data as noise and no data is recognized. In this case, we recommend that you adjust the recording device or get closer to the device when you speak.
If the amplitude of the speech waveform is large and the sound energy is high, the amplitude may be limited. The error can be caused by speech distortion. In this case, we recommend that you adjust the recording device or get farther away from the device when you speak.
If the frequency band of the speech data is incomplete in the frequency domain, the speech data may be incorrectly recognized. The standard training data of the ASR model is sampled at 8 kHz or 16 kHz with a complete frequency band. We recommend that you check whether your speech data is sampled at 8 kHz or 16 kHz with a complete frequency band. In addition, we recommend that you use custom models to optimize the speech recognition rate.
If hotwords are used, limit the weight of hotwords. Otherwise, the speech data may be truncated.
To resolve general speech recognition errors, you can create and train custom models to optimize the speech recognition rate. Specifically, you can repeat poorly recognized sentences (not single words) more times in the training corpus to increase their weight in the language model.
For the recording file recognition service, if the speech data is recorded in the same track, different human voices may overlap and cannot be recognized correctly. This error is not an ASR error. We recommend that you record different human voices separately in different tracks.
If your error does not match any of the preceding scenarios or still cannot be resolved, open a ticket and provide the following information:
The version of the deployed Alibaba Cloud Intelligent Speech Interaction, such as Intelligent Speech Interaction V1.0 or V2.0.
The service that you call, such as short sentence recognition, real-time speech recognition, or recording file recognition.
Your business scenario.
The audio sampling rate of your speech data, such as 8 kHz or 16 kHz.
Whether hotwords are used.
Whether custom models are used and poorly recognized sentences are repeated more times during custom model training.
Confirm that the preceding information is complete and custom models are trained and optimized as described in item 6. If the error persists, provide the speech data that cannot be recognized properly and the correct and error recognition results for the data, and briefly describe the speech recognition error.