Community Blog How to Build a Voiceprint System in Just Three Steps

How to Build a Voiceprint System in Just Three Steps

In this post, we'll show you how you can easily build a voiceprint recognition and retrieval system in just three steps with AnalyticDB.

By Hanchao


Voiceprint retrieval, as its name implies, is the process in which you authenticate or recognize speakers by their voice. A critical step in voiceprint recognition is voice vectorization, which converts the voice of speakers into structured vectors. Alibaba Cloud AnalyticDB for MySQL and AnalyticDB for PostgreSQL have provided a solution for voiceprint authentication and retrieval. With a few simple SQL commands, you can build a set of high-accuracy voiceprint retrieval and authentication services in just three steps.

Voiceprint Recognition Technologies

1. Voiceprint Retrieval Demo

Figure 1. Voiceprint demo system

Figure 1 shows the demo interface of the voiceprint retrieval system in AnalyticDB Vector Edition. To facilitate your experience, we have converted the voice information of 340 people into vectors and stored these vectors in the system. The current demo system consists of two parts. In the first part, which is the retrieval function, you can either import a recorded sound file or record a sound file on site and upload it. Then, you can submit the sound file to the voiceprint database for matching and retrieval. In the second part, which is the registration function, you can register and upload your own voice to the current voiceprint database to facilitate later queries and authentication. We will describe each function separately in the following sections.

Figure 2. Voice query

As shown in figure 2, BAC009S0004W0486.wav, a test audio file that contains the voice of S0004, is uploaded to the voiceprint database for retrieval. S0004 ranks first and appears at the top of the result table.

Figure 3. Voice registration

Figure 3 shows the voiceprint registration system, in which you can register your own voice in the backend voiceprint database for easy retrieval. For example, the user Hanchao registers his voice (only 7s in length) in the current system. At present, the system supports registration without text, and you can register by speaking any word.

Figure 4. Voice recording and retrieval

As show in figure 4, users can record their voice on site and upload it to the system for retrieval. For example, Hanchao records a 5-second voice clip and retrieves it in the voiceprint system. Hanchao voice, which has been previously registered, ranks first in the result table.

The current voiceprint demo system returns the results of 1:N identification. With this method, you can identify the corresponding speaker in a conference room by voice. At present, in 1:1 authentication demo, you can limit the distance to 550 for convenient authentication.

2. Overall Design of Application Structure

Figure 5. Voiceprint retrieval database

Figure 5 shows the overall structure of the retrieval system in Alibaba Cloud voiceprint database. AnalyticDB (voiceprint database) is responsible for storing and querying all structured information (user registration ID, user name, and other user information) and unstructured information (vectors generated from voice) throughout the voiceprint retrieval application. During the query process, you can use voiceprint extraction models to convert voice into vectors and query them in AnalyticDB. The system returns the corresponding user information and the I2 vector distance [5]. We will explain how to train and test voice extraction models in the next article.

3. System Accuracy

The current demo voiceprint system uses the GMM-UMB model to extract i-vectors for retrieval [3]. In addition, we have trained a more accurate deep learning model for voiceprint recognition (x-vector [4]). Furthermore, we can train voiceprint models for specific scenarios, such as phone calls, mobile apps, and noisy environments.

The accuracy of voiceprint recognition (1:N) in the datasets that are commonly used in academia (Aishall.v1 [1] datasets and TIMIT [2] datasets) is more than 99.5%, as listed in table 1.

Table 1. Accuracy of the results that rank first

Three Steps to Building a Voiceprint System

Step 1: Initialization

The first step is initialization.

The current system has implemented the voice-to-vector conversion function. After you send the voice obtained from the frontend to the Alibaba Cloud service system through a POST request and select the appropriate voiceprint model, the system converts the voice into a corresponding vector.

import requests
import json
import numpy as np

# sound: binary sound file.
# model_id: ID of the model.
def get_vector(sound, model_id='i-vector'):
    url = ''
    d = {'resource': sound,
         'model_id': model_id}
    r = requests.post(url, data=d)
    js = json.loads(r.text)
    return np.array(js['emb'])

# Read the user file.
file = 'xxx.wav'
data = f.read()

During initialization, create a corresponding user voiceprint table. In addition, add a vector index to the vector column in the table to accelerate the query process. The current voiceprint model generates 400-dimensional vectors. Therefore, set the index parameter "dim" to 400.

-- Create a user voiceprint table
CREATE TABLE person_voiceprint_detection_table(
    id serial primary key, 
  name varchar,
  voiceprint_feature float4[]

-- Create a vector index
CREATE INDEX person_voiceprint_detection_table_idx 
ON person_voiceprint_detection_table 
USING ann(voiceprint_feature) 

Step 2: Registering User's Voice

The second step is registering the user's voice.

During registration, register a user and insert a record into the current system.

-- Register the user "John" in the current system.
-- Use the HTTP service to convert the voiceprint into a corresponding vector.

INSERT INTO person_voiceprint_detection_table(name, voiceprint_feature)
SELECT 'John', array[-0.017,-0.032,...]::float4[])

Step 3: Retrieving and Authenticating User's Voice

The third step is retrieving and authenticating the user's voice.

Voiceprint authentication for door locks (1:1): The authentication system obtains the user's identity information (user_id) and calculates the distance between the input voice vector and the user's voice vector in the voiceprint database. Generally, a distance threshold (threshold = 550) is set in the system. If the distance between the vectors is larger than the threshold, the authentication fails. If the distance is lower than the threshold, the voiceprint authentication is successful.

-- Voiceprint authentication for door locks (1:1)

SELECT  id,    -- User ID
                name, -- User name
        l2_distance(voiceprint_feature, ARRAY[-0.017,-0.032,...]::float4[]) AS distance -- Distance between the vectors
FROM person_voiceprint_detection_table -- User voice table
WHERE distance < threshold -- Generally, the threshold is 550
AND id = 'user_id' -- The user ID to authenticate

Voiceprint retrieval for conference (1:N identification): The system identifies the voice of the current speaker and returns the information of the most relevant registered users. If the system returns no results, the current conference speaker is not registered in the voiceprint database.

-- Voiceprint recognition of a conference speaker (1:N identification)

SELECT  id,    -- User ID
                name, -- User name
        l2_distance(voiceprint_feature, ARRAY[-0.017,-0.032,...]::float4[]) AS distance -- Distance between the vectors
FROM person_voiceprint_detection_table -- User voice table
WHERE distance < threshold -- Generally, the threshold is 550
ORDER BY voiceprint_feature <-> ARRAY[-0.017,-0.032,...]::float4[] -- Use the vectors to sort
LIMIT 1; -- Return the most similar results


[1] Aishell Data set.

[2] TIMIT Data set.

[3] Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, "Front-end factor analysis for speaker verification," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2011.

[4] David Snyder, Daniel Garcia-Romero, Daniel Povey and Sanjeev Khudanpur, "Deep Neural Network Embeddings for Text-Independent Speaker Verification", Interspeech , 2017 :999-1003.

[5] Anton, Howard (1994), Elementary Linear Algebra (7th ed.), John Wiley & Sons, pp. 170-171, ISBN 978-0-471-58742-2

0 0 0
Share on


356 posts | 49 followers

You may also like