All Products
Search
Document Center

AnalyticDB:Voiceprint retrieval

Last Updated:Apr 01, 2026

This document introduces the voiceprint recognition solution based on AnalyticDB for MySQL and details a use case for detecting sensitive content in recordings of ride-hailing drivers. This solution is applicable to scenarios such as ride-hailing services, multi-person meetings, offline sales, AI voice recorders, and voice assistants. It identifies speakers and inspects spoken content to help enterprises efficiently build intelligent voiceprint retrieval systems.

Background

In the digital era, voice is a key biometric identifier for identity authentication, security control, and intelligent interaction. Voiceprint recognition technology extracts vocal features and converts them into structured vectors to efficiently verify and retrieve speakers.

AnalyticDB for MySQL provides an end-to-end voiceprint recognition solution based on its native vector storage and retrieval capabilities. It supports three core functions: voiceprint comparison, retrieval, and clustering. You can extend this solution with features like speaker diarization, speech-to-text, and content quality inspection to help you quickly build high-precision voiceprint retrieval systems.

Limitations

The voiceprint retrieval feature is currently in invitational preview. To use this feature, submit a ticket to technical support to enable it.

Features

Voiceprint comparison

This feature uses a built-in voiceprint model to extract voiceprint features from raw audio and convert them into structured vectors. By calculating the similarity between two voice vectors, it determines whether they belong to the same speaker, enabling 1:1 voiceprint identity verification.

Voiceprint retrieval

This feature uses voiceprint feature vectors and an efficient index to quickly retrieve a target speaker from an established voiceprint library. It supports 1:N voiceprint recognition scenarios and is ideal for efficient identity matching in large-scale voiceprint libraries.

Voiceprint clustering

This feature uses unsupervised learning techniques to analyze unlabeled audio data and automatically group it by speaker identity. It effectively handles multi-speaker audio scenarios to enable intelligent grouping and management of audio data.

Usage

API

Generate audio embedding - /audio/embedding

  • Method: POST

  • Function: Generates an embedding vector for an audio file.

Parameters:

Parameter

Type

Required

Description

input_audio

string

Yes

URL of the input audio.

oss_ak

string

No

OSS AccessKey ID.

oss_sk

string

No

OSS AccessKey Secret.

oss_token

string

No

OSS STS token.

start_time

float

No

Start timestamp.

end_time

float

No

End timestamp.

Request example:

curl -X POST "http://addr:8100/audio/embedding" \
  -H "Authorization: Bearer {api-key}" \
  -H "Content-Type: application/json" \
  -d '{
    "input_audio": "https://dashscope.oss-cn-beijing-internal.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "start_time": 1220,
    "end_time": 3200
  }'

Response example:

{
  "code": 200,
  "message": "success",
  "result": [1.861976, 0.151182, -0.888397, ...]
}

Enroll a voiceprint - /voice/enroll

  • Method: POST

  • Function: Enrolls a new voiceprint sample.

Parameters:

Parameter

Type

Required

Description

audio_url

string

Yes

URL of the input audio.

name

string

Yes

Name of the voiceprint.

Request example:

curl -X POST "http://addr:8100/voice/enroll" \
  -H "Authorization: Bearer {api-key}" \
  -H "Content-Type: application/json" \
  -d '{
    "audio_url": "https://dashscope.oss-cn-beijing-internal.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "name": "test1"
  }'

Response example:

{"code": 200, "message": "Voiceprint enrollment successful", "result": true}

Query a voiceprint - /voice/query

  • Method: GET or POST

  • Function: Queries enrolled voiceprint records.

Parameters:

Parameter

Type

Required

Description

name

string

No

Name of the voiceprint.

id

integer

No

ID of the voiceprint.

Request example:

curl "http://addr:8100/voice/query" \
  -H "Authorization: Bearer {api-key}"

Response example:

{
  "code": 200,
  "message": "Found 1 voiceprint records",
  "result": [{"id": 1968033551534260224, "name": "test1", "location": null}]
}

Delete a voiceprint - /voice/delete

  • Method: DELETE or POST

  • Function: Deletes a specified voiceprint record.

Parameters:

Parameter

Type

Required

Description

name

string

Conditional

Name of the voiceprint. You must specify either this parameter or id.

id

integer

Conditional

ID of the voiceprint. You must specify either this parameter or name.

Request example:

curl -X POST "http://addr:8100/voice/delete" \
  -H "Authorization: Bearer {api-key}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "test1"
  }'

Response example:

{"code": 200, "message": "Voiceprint deletion successful", "result": true}

Voiceprint retrieval - /voice/search

  • Method: POST

  • Function: Finds the best-matching voiceprint for a given audio file.

Parameters:

Parameter

Type

Required

Description

audio_url

string

Yes

URL of the audio file to search.

names

array

No

List of voiceprint names to include in the search.

top_k

integer

No

Number of top results to return. Default value: 1.

Request example:

curl -X POST "http://addr:8100/voice/search" \
  -H "Authorization: Bearer {api-key}" \
  -H "Content-Type: application/json" \
  -d '{
    "audio_url": "https://dashscope.oss-cn-beijing-internal.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "top_k": 1
  }'

Response example:

{
  "code": 200,
  "message": "Found 1 matching voiceprints",
  "result": [{"name": "test1", "similarity": 0.99999994}]
}

ASR (speech-to-text) - /bailian/funasr/asr

  • Method: POST

  • Function: Transcribes audio to text using Alibaba Cloud Model Studio FunASR.

Parameters:

Parameter

Type

Required

Description

source_url

string

Yes

Path of the audio file in OSS.

model_name

string

No

The model name. The default value is fun-asr. For a list of available models, see the Alibaba Cloud Model Studio FunASR Model list.

diarization

bool

No

Specifies whether to enable speaker diarization. The default value is true.

speaker_count

int

No

Number of speakers. If this parameter is not specified, the system automatically detects the number of speakers.

output_type

string

No

The output format. url returns a temporary URL for the result file, and json returns structured data. The default value is url.

lang

string

No

Language setting. For supported languages, see the language_hints parameter in the Alibaba Cloud Model Studio FunASR request parameters documentation.

diarization_mode

string

No

The supported aggregation dimensions for speaker diarization results are word, sentence, and speaker.

Request example:

curl -X POST "http://addr:8100/bailian/funasr/asr" \
  -H "Authorization: Bearer {api-key}" \
  -H "Content-Type: application/json" \
  -d '{
    "source_url": "https://dashscope.oss-cn-beijing-internal.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "output_type": "json",
    "diarization_mode": "sentence",
    "model_name": "fun-asr-mtl",
    "lang": "zh"
  }'

Response example:

{
  "code": 200,
  "message": "success",
  "result": {
    "file_url": "https://dashscope.oss-cn-beijing-internal.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "properties": {
      "audio_format": "pcm_s16le",
      "channels": [0],
      "original_sampling_rate": 16000,
      "original_duration_in_milliseconds": 3834
    },
    "transcripts": [
      {
        "channel_id": 0,
        "speakers": [
          {
            "speaker_id": 0,
            "sentences": [
              {
                "sentence_id": 1,
                "start_time": 100,
                "end_time": 3300,
                "text": "Hello world, this is the Alibaba Speech Lab."
              }
            ]
          }
        ]
      }
    ]
  }
}

SQL

AnalyticDB for MySQL includes AI functions that let you perform voiceprint retrieval directly in SQL.

ai_audio_embed

Generates a voiceprint embedding vector for an audio file. For more information, see ai_audio_embed.

ai_audio_embed(text)
ai_audio_embed(model_name, text)
ai_audio_embed(model_name, text, options)

ai_audio_transcribe

Transcribes an audio file into text. For more information, see ai_audio_transcribe.

ai_audio_transcribe(url)
ai_audio_transcribe(model_name, url)
ai_audio_transcribe(model_name, url, options)

Examples

Create a voiceprint library table

CREATE DATABASE ai;

CREATE TABLE IF NOT EXISTS
  ai.voiceTest (
    id bigint NOT NULL AUTO_INCREMENT,
    name varchar NOT NULL,
    voiceprint_feature ARRAY<float>(512) ENCODE = 'no' COMPRESSION = 'no',
    ANN INDEX idx_voiceprint_feature (voiceprint_feature),
    PRIMARY KEY (id)
  ) INDEX_ALL = 'Y' STORAGE_POLICY = 'HOT' ENGINE = 'XUANWU'
    TABLE_PROPERTIES = '{"format":"columnstore"}'
    DISTRIBUTE BY HASH (id);

Enroll a voiceprint

INSERT INTO ai.voiceTest (name, voiceprint_feature)
SELECT '{name}', ai_audio_embed('{audio_file}');

Query a voiceprint

SELECT id, name FROM ai.voiceTest WHERE name = '{name}';

Delete a voiceprint

DELETE FROM ai.voiceTest WHERE name = '{name}';

Perform voiceprint retrieval

SELECT name, similarity
FROM (
    SELECT
        name,
        cosine_similarity(
            voiceprint_feature,
            ai_audio_embed('audio_embedding', '{audio_file}',
              '{''start_time'':1000, ''end_time'':3000}')
        ) AS similarity
    FROM ai.voiceTest
) t
ORDER BY similarity DESC
LIMIT 3;

Use case: Driver monitoring and content detection

Background

A ride-hailing company wanted to use voice recognition technology to analyze in-vehicle audio recordings. Their goal was to accurately extract the driver's voice from multi-person conversations and identify any policy violations in the driver's speech.

Using the voiceprint recognition solution from AnalyticDB for MySQL, the company built an end-to-end system. This system covers key steps such as speaker diarization, noise reduction, Automatic Speech Recognition (ASR), automatic voiceprint library construction, voiceprint retrieval, and text content quality inspection.

Solution workflow

  1. Audio enhancement: Preprocess the raw audio to reduce background noise and enhance human voices.

  2. Speaker diarization: Uses speaker recognition technology to separate voices and attribute each segment to its speaker.

  3. Audio segmentation: Based on the speaker diarization results, split the original audio into independent clips for each speaker to facilitate separate processing and analysis.

  4. Voiceprint recognition and speech-to-text: Extracts spoken content from each audio segment using voiceprint recognition and ASR.

  5. Voiceprint retrieval: Matches audio segments to a driver's identity by searching the voiceprint library.

  6. Content quality inspection: Integrates the speaker's identity with the transcribed text, and then uses a large language model (LLM) to analyze the content for policy violations.