This document introduces the voiceprint recognition solution based on AnalyticDB for MySQL and details a use case for detecting sensitive content in recordings of ride-hailing drivers. This solution is applicable to scenarios such as ride-hailing services, multi-person meetings, offline sales, AI voice recorders, and voice assistants. It identifies speakers and inspects spoken content to help enterprises efficiently build intelligent voiceprint retrieval systems.
Background
In the digital era, voice is a key biometric identifier for identity authentication, security control, and intelligent interaction. Voiceprint recognition technology extracts vocal features and converts them into structured vectors to efficiently verify and retrieve speakers.
AnalyticDB for MySQL provides an end-to-end voiceprint recognition solution based on its native vector storage and retrieval capabilities. It supports three core functions: voiceprint comparison, retrieval, and clustering. You can extend this solution with features like speaker diarization, speech-to-text, and content quality inspection to help you quickly build high-precision voiceprint retrieval systems.
Limitations
The voiceprint retrieval feature is currently in invitational preview. To use this feature, submit a ticket to technical support to enable it.
Features
Voiceprint comparison
This feature uses a built-in voiceprint model to extract voiceprint features from raw audio and convert them into structured vectors. By calculating the similarity between two voice vectors, it determines whether they belong to the same speaker, enabling 1:1 voiceprint identity verification.
Voiceprint retrieval
This feature uses voiceprint feature vectors and an efficient index to quickly retrieve a target speaker from an established voiceprint library. It supports 1:N voiceprint recognition scenarios and is ideal for efficient identity matching in large-scale voiceprint libraries.
Voiceprint clustering
This feature uses unsupervised learning techniques to analyze unlabeled audio data and automatically group it by speaker identity. It effectively handles multi-speaker audio scenarios to enable intelligent grouping and management of audio data.
Usage
API
Generate audio embedding - /audio/embedding
Method:
POSTFunction: Generates an embedding vector for an audio file.
Parameters:
Parameter | Type | Required | Description |
input_audio | string | Yes | URL of the input audio. |
oss_ak | string | No | OSS AccessKey ID. |
oss_sk | string | No | OSS AccessKey Secret. |
oss_token | string | No | OSS STS token. |
start_time | float | No | Start timestamp. |
end_time | float | No | End timestamp. |
Request example:
curl -X POST "http://addr:8100/audio/embedding" \
-H "Authorization: Bearer {api-key}" \
-H "Content-Type: application/json" \
-d '{
"input_audio": "https://dashscope.oss-cn-beijing-internal.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
"start_time": 1220,
"end_time": 3200
}'Response example:
{
"code": 200,
"message": "success",
"result": [1.861976, 0.151182, -0.888397, ...]
}Enroll a voiceprint - /voice/enroll
Method:
POSTFunction: Enrolls a new voiceprint sample.
Parameters:
Parameter | Type | Required | Description |
audio_url | string | Yes | URL of the input audio. |
name | string | Yes | Name of the voiceprint. |
Request example:
curl -X POST "http://addr:8100/voice/enroll" \
-H "Authorization: Bearer {api-key}" \
-H "Content-Type: application/json" \
-d '{
"audio_url": "https://dashscope.oss-cn-beijing-internal.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
"name": "test1"
}'Response example:
{"code": 200, "message": "Voiceprint enrollment successful", "result": true}Query a voiceprint - /voice/query
Method:
GETorPOSTFunction: Queries enrolled voiceprint records.
Parameters:
Parameter | Type | Required | Description |
name | string | No | Name of the voiceprint. |
id | integer | No | ID of the voiceprint. |
Request example:
curl "http://addr:8100/voice/query" \
-H "Authorization: Bearer {api-key}"Response example:
{
"code": 200,
"message": "Found 1 voiceprint records",
"result": [{"id": 1968033551534260224, "name": "test1", "location": null}]
}Delete a voiceprint - /voice/delete
Method:
DELETEorPOSTFunction: Deletes a specified voiceprint record.
Parameters:
Parameter | Type | Required | Description |
name | string | Conditional | Name of the voiceprint. You must specify either this parameter or |
id | integer | Conditional | ID of the voiceprint. You must specify either this parameter or |
Request example:
curl -X POST "http://addr:8100/voice/delete" \
-H "Authorization: Bearer {api-key}" \
-H "Content-Type: application/json" \
-d '{
"name": "test1"
}'Response example:
{"code": 200, "message": "Voiceprint deletion successful", "result": true}Voiceprint retrieval - /voice/search
Method:
POSTFunction: Finds the best-matching voiceprint for a given audio file.
Parameters:
Parameter | Type | Required | Description |
audio_url | string | Yes | URL of the audio file to search. |
names | array | No | List of voiceprint names to include in the search. |
top_k | integer | No | Number of top results to return. Default value: 1. |
Request example:
curl -X POST "http://addr:8100/voice/search" \
-H "Authorization: Bearer {api-key}" \
-H "Content-Type: application/json" \
-d '{
"audio_url": "https://dashscope.oss-cn-beijing-internal.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
"top_k": 1
}'Response example:
{
"code": 200,
"message": "Found 1 matching voiceprints",
"result": [{"name": "test1", "similarity": 0.99999994}]
}ASR (speech-to-text) - /bailian/funasr/asr
Method:
POSTFunction: Transcribes audio to text using Alibaba Cloud Model Studio FunASR.
Parameters:
Parameter | Type | Required | Description |
source_url | string | Yes | Path of the audio file in OSS. |
model_name | string | No | The model name. The default value is |
diarization | bool | No | Specifies whether to enable speaker diarization. The default value is |
speaker_count | int | No | Number of speakers. If this parameter is not specified, the system automatically detects the number of speakers. |
output_type | string | No | The output format. |
lang | string | No | Language setting. For supported languages, see the |
diarization_mode | string | No | The supported aggregation dimensions for speaker diarization results are |
Request example:
curl -X POST "http://addr:8100/bailian/funasr/asr" \
-H "Authorization: Bearer {api-key}" \
-H "Content-Type: application/json" \
-d '{
"source_url": "https://dashscope.oss-cn-beijing-internal.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
"output_type": "json",
"diarization_mode": "sentence",
"model_name": "fun-asr-mtl",
"lang": "zh"
}'Response example:
{
"code": 200,
"message": "success",
"result": {
"file_url": "https://dashscope.oss-cn-beijing-internal.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
"properties": {
"audio_format": "pcm_s16le",
"channels": [0],
"original_sampling_rate": 16000,
"original_duration_in_milliseconds": 3834
},
"transcripts": [
{
"channel_id": 0,
"speakers": [
{
"speaker_id": 0,
"sentences": [
{
"sentence_id": 1,
"start_time": 100,
"end_time": 3300,
"text": "Hello world, this is the Alibaba Speech Lab."
}
]
}
]
}
]
}
}SQL
AnalyticDB for MySQL includes AI functions that let you perform voiceprint retrieval directly in SQL.
ai_audio_embed
Generates a voiceprint embedding vector for an audio file. For more information, see ai_audio_embed.
ai_audio_embed(text)
ai_audio_embed(model_name, text)
ai_audio_embed(model_name, text, options)ai_audio_transcribe
Transcribes an audio file into text. For more information, see ai_audio_transcribe.
ai_audio_transcribe(url)
ai_audio_transcribe(model_name, url)
ai_audio_transcribe(model_name, url, options)Examples
Create a voiceprint library table
CREATE DATABASE ai;
CREATE TABLE IF NOT EXISTS
ai.voiceTest (
id bigint NOT NULL AUTO_INCREMENT,
name varchar NOT NULL,
voiceprint_feature ARRAY<float>(512) ENCODE = 'no' COMPRESSION = 'no',
ANN INDEX idx_voiceprint_feature (voiceprint_feature),
PRIMARY KEY (id)
) INDEX_ALL = 'Y' STORAGE_POLICY = 'HOT' ENGINE = 'XUANWU'
TABLE_PROPERTIES = '{"format":"columnstore"}'
DISTRIBUTE BY HASH (id);Enroll a voiceprint
INSERT INTO ai.voiceTest (name, voiceprint_feature)
SELECT '{name}', ai_audio_embed('{audio_file}');Query a voiceprint
SELECT id, name FROM ai.voiceTest WHERE name = '{name}';Delete a voiceprint
DELETE FROM ai.voiceTest WHERE name = '{name}';Perform voiceprint retrieval
SELECT name, similarity
FROM (
SELECT
name,
cosine_similarity(
voiceprint_feature,
ai_audio_embed('audio_embedding', '{audio_file}',
'{''start_time'':1000, ''end_time'':3000}')
) AS similarity
FROM ai.voiceTest
) t
ORDER BY similarity DESC
LIMIT 3;Use case: Driver monitoring and content detection
Background
A ride-hailing company wanted to use voice recognition technology to analyze in-vehicle audio recordings. Their goal was to accurately extract the driver's voice from multi-person conversations and identify any policy violations in the driver's speech.
Using the voiceprint recognition solution from AnalyticDB for MySQL, the company built an end-to-end system. This system covers key steps such as speaker diarization, noise reduction, Automatic Speech Recognition (ASR), automatic voiceprint library construction, voiceprint retrieval, and text content quality inspection.
Solution workflow
Audio enhancement: Preprocess the raw audio to reduce background noise and enhance human voices.
Speaker diarization: Uses speaker recognition technology to separate voices and attribute each segment to its speaker.
Audio segmentation: Based on the speaker diarization results, split the original audio into independent clips for each speaker to facilitate separate processing and analysis.
Voiceprint recognition and speech-to-text: Extracts spoken content from each audio segment using voiceprint recognition and ASR.
Voiceprint retrieval: Matches audio segments to a driver's identity by searching the voiceprint library.
Content quality inspection: Integrates the speaker's identity with the transcribed text, and then uses a large language model (LLM) to analyze the content for policy violations.