Voiceprint retrieval solution based on AnalyticDB for MySQL - AnalyticDB

This document introduces the voiceprint recognition solution based on AnalyticDB for MySQL and details a use case for detecting sensitive content in recordings of ride-hailing drivers. This solution is applicable to scenarios such as ride-hailing services, multi-person meetings, offline sales, AI voice recorders, and voice assistants. It identifies speakers and inspects spoken content to help enterprises efficiently build intelligent voiceprint retrieval systems.

Background

In the digital era, voice is a key biometric identifier for identity authentication, security control, and intelligent interaction. Voiceprint recognition technology extracts vocal features and converts them into structured vectors to efficiently verify and retrieve speakers.

AnalyticDB for MySQL provides an end-to-end voiceprint recognition solution based on its native vector storage and retrieval capabilities. It supports three core functions: voiceprint comparison, retrieval, and clustering. You can extend this solution with features like speaker diarization, speech-to-text, and content quality inspection to help you quickly build high-precision voiceprint retrieval systems.

Limitations

The voiceprint retrieval feature is currently in invitational preview. To use this feature, submit a ticket to technical support to enable it.

Features

Voiceprint comparison

This feature uses a built-in voiceprint model to extract voiceprint features from raw audio and convert them into structured vectors. By calculating the similarity between two voice vectors, it determines whether they belong to the same speaker, enabling 1:1 voiceprint identity verification.

Voiceprint retrieval

This feature uses voiceprint feature vectors and an efficient index to quickly retrieve a target speaker from an established voiceprint library. It supports 1:N voiceprint recognition scenarios and is ideal for efficient identity matching in large-scale voiceprint libraries.

Voiceprint clustering

This feature uses unsupervised learning techniques to analyze unlabeled audio data and automatically group it by speaker identity. It effectively handles multi-speaker audio scenarios to enable intelligent grouping and management of audio data.

Usage

API

Generate audio embedding - `/audio/embedding`

Method: POST
Function: Generates an embedding vector for an audio file.

Parameters:

Parameter	Type	Required	Description
input_audio	string	Yes	URL of the input audio.
oss_ak	string	No	OSS AccessKey ID.
oss_sk	string	No	OSS AccessKey Secret.
oss_token	string	No	OSS STS token.
start_time	float	No	Start timestamp.
end_time	float	No	End timestamp.

Request example:

curl -X POST "http://addr:8100/audio/embedding" \
  -H "Authorization: Bearer {api-key}" \
  -H "Content-Type: application/json" \
  -d '{
    "input_audio": "https://dashscope.oss-cn-beijing-internal.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "start_time": 1220,
    "end_time": 3200
  }'

Response example:

{
  "code": 200,
  "message": "success",
  "result": [1.861976, 0.151182, -0.888397, ...]
}

Enroll a voiceprint - `/voice/enroll`

Method: POST
Function: Enrolls a new voiceprint sample.

Parameters:

Parameter	Type	Required	Description
audio_url	string	Yes	URL of the input audio.
name	string	Yes	Name of the voiceprint.

Request example:

curl -X POST "http://addr:8100/voice/enroll" \
  -H "Authorization: Bearer {api-key}" \
  -H "Content-Type: application/json" \
  -d '{
    "audio_url": "https://dashscope.oss-cn-beijing-internal.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "name": "test1"
  }'

Response example:

{"code": 200, "message": "Voiceprint enrollment successful", "result": true}

Query a voiceprint - `/voice/query`

Method: GET or POST
Function: Queries enrolled voiceprint records.

Parameters:

Parameter	Type	Required	Description
name	string	No	Name of the voiceprint.
id	integer	No	ID of the voiceprint.

Request example:

curl "http://addr:8100/voice/query" \
  -H "Authorization: Bearer {api-key}"

Response example:

{
  "code": 200,
  "message": "Found 1 voiceprint records",
  "result": [{"id": 1968033551534260224, "name": "test1", "location": null}]
}

Delete a voiceprint - `/voice/delete`

Method: DELETE or POST
Function: Deletes a specified voiceprint record.

Parameters:

Parameter	Type	Required	Description
name	string	Conditional	Name of the voiceprint. You must specify either this parameter or `id`.
id	integer	Conditional	ID of the voiceprint. You must specify either this parameter or `name`.

Request example:

curl -X POST "http://addr:8100/voice/delete" \
  -H "Authorization: Bearer {api-key}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "test1"
  }'

Response example:

{"code": 200, "message": "Voiceprint deletion successful", "result": true}

Voiceprint retrieval - `/voice/search`

Method: POST
Function: Finds the best-matching voiceprint for a given audio file.

Parameters:

Parameter	Type	Required	Description
audio_url	string	Yes	URL of the audio file to search.
names	array	No	List of voiceprint names to include in the search.
top_k	integer	No	Number of top results to return. Default value: 1.

Request example:

curl -X POST "http://addr:8100/voice/search" \
  -H "Authorization: Bearer {api-key}" \
  -H "Content-Type: application/json" \
  -d '{
    "audio_url": "https://dashscope.oss-cn-beijing-internal.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "top_k": 1
  }'

Response example:

{
  "code": 200,
  "message": "Found 1 matching voiceprints",
  "result": [{"name": "test1", "similarity": 0.99999994}]
}

ASR (speech-to-text) - `/bailian/funasr/asr`

Method: POST
Function: Transcribes audio to text using Alibaba Cloud Model Studio FunASR.

Parameters:

Parameter	Type	Required	Description
source_url	string	Yes	Path of the audio file in OSS.
model_name	string	No	The model name. The default value is `fun-asr`. For a list of available models, see the Alibaba Cloud Model Studio FunASR Model list.
diarization	bool	No	Specifies whether to enable speaker diarization. The default value is `true`.
speaker_count	int	No	Number of speakers. If this parameter is not specified, the system automatically detects the number of speakers.
output_type	string	No	The output format. `url` returns a temporary URL for the result file, and `json` returns structured data. The default value is `url`.
lang	string	No	Language setting. For supported languages, see the `language_hints` parameter in the Alibaba Cloud Model Studio FunASR request parameters documentation.
diarization_mode	string	No	The supported aggregation dimensions for speaker diarization results are `word`, `sentence`, and `speaker`.

Request example:

curl -X POST "http://addr:8100/bailian/funasr/asr" \
  -H "Authorization: Bearer {api-key}" \
  -H "Content-Type: application/json" \
  -d '{
    "source_url": "https://dashscope.oss-cn-beijing-internal.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "output_type": "json",
    "diarization_mode": "sentence",
    "model_name": "fun-asr-mtl",
    "lang": "zh"
  }'

Response example:

{
  "code": 200,
  "message": "success",
  "result": {
    "file_url": "https://dashscope.oss-cn-beijing-internal.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "properties": {
      "audio_format": "pcm_s16le",
      "channels": [0],
      "original_sampling_rate": 16000,
      "original_duration_in_milliseconds": 3834
    },
    "transcripts": [
      {
        "channel_id": 0,
        "speakers": [
          {
            "speaker_id": 0,
            "sentences": [
              {
                "sentence_id": 1,
                "start_time": 100,
                "end_time": 3300,
                "text": "Hello world, this is the Alibaba Speech Lab."
              }
            ]
          }
        ]
      }
    ]
  }
}

SQL

AnalyticDB for MySQL includes AI functions that let you perform voiceprint retrieval directly in SQL.

ai_audio_embed

Generates a voiceprint embedding vector for an audio file. For more information, see ai_audio_embed.

ai_audio_embed(text)
ai_audio_embed(model_name, text)
ai_audio_embed(model_name, text, options)

ai_audio_transcribe

Transcribes an audio file into text. For more information, see ai_audio_transcribe.

ai_audio_transcribe(url)
ai_audio_transcribe(model_name, url)
ai_audio_transcribe(model_name, url, options)

Examples

Create a voiceprint library table

CREATE DATABASE ai;

CREATE TABLE IF NOT EXISTS
  ai.voiceTest (
    id bigint NOT NULL AUTO_INCREMENT,
    name varchar NOT NULL,
    voiceprint_feature ARRAY<float>(512) ENCODE = 'no' COMPRESSION = 'no',
    ANN INDEX idx_voiceprint_feature (voiceprint_feature),
    PRIMARY KEY (id)
  ) INDEX_ALL = 'Y' STORAGE_POLICY = 'HOT' ENGINE = 'XUANWU'
    TABLE_PROPERTIES = '{"format":"columnstore"}'
    DISTRIBUTE BY HASH (id);

Enroll a voiceprint

INSERT INTO ai.voiceTest (name, voiceprint_feature)
SELECT '{name}', ai_audio_embed('{audio_file}');

Query a voiceprint

SELECT id, name FROM ai.voiceTest WHERE name = '{name}';

Delete a voiceprint

DELETE FROM ai.voiceTest WHERE name = '{name}';

Perform voiceprint retrieval

SELECT name, similarity
FROM (
    SELECT
        name,
        cosine_similarity(
            voiceprint_feature,
            ai_audio_embed('audio_embedding', '{audio_file}',
              '{''start_time'':1000, ''end_time'':3000}')
        ) AS similarity
    FROM ai.voiceTest
) t
ORDER BY similarity DESC
LIMIT 3;

Use case: Driver monitoring and content detection

Background

A ride-hailing company wanted to use voice recognition technology to analyze in-vehicle audio recordings. Their goal was to accurately extract the driver's voice from multi-person conversations and identify any policy violations in the driver's speech.

Using the voiceprint recognition solution from AnalyticDB for MySQL, the company built an end-to-end system. This system covers key steps such as speaker diarization, noise reduction, Automatic Speech Recognition (ASR), automatic voiceprint library construction, voiceprint retrieval, and text content quality inspection.

Solution workflow

Audio enhancement: Preprocess the raw audio to reduce background noise and enhance human voices.
Speaker diarization: Uses speaker recognition technology to separate voices and attribute each segment to its speaker.
Audio segmentation: Based on the speaker diarization results, split the original audio into independent clips for each speaker to facilitate separate processing and analysis.
Voiceprint recognition and speech-to-text: Extracts spoken content from each audio segment using voiceprint recognition and ASR.
Voiceprint retrieval: Matches audio segments to a driver's identity by searching the voiceprint library.
Content quality inspection: Integrates the speaker's identity with the transcribed text, and then uses a large language model (LLM) to analyze the content for policy violations.

Background

Limitations

Features

Voiceprint comparison

Voiceprint retrieval

Voiceprint clustering

Usage

API

Generate audio embedding - /audio/embedding

Enroll a voiceprint - /voice/enroll

Query a voiceprint - /voice/query

Delete a voiceprint - /voice/delete

Voiceprint retrieval - /voice/search

ASR (speech-to-text) - /bailian/funasr/asr

SQL

ai_audio_embed

ai_audio_transcribe

Examples

Use case: Driver monitoring and content detection

Background

Solution workflow

Generate audio embedding - `/audio/embedding`

Enroll a voiceprint - `/voice/enroll`

Query a voiceprint - `/voice/query`

Delete a voiceprint - `/voice/delete`

Voiceprint retrieval - `/voice/search`

ASR (speech-to-text) - `/bailian/funasr/asr`