ICASSP 2022多通道多方会议转录挑战


  • Call for Participation
  • Recent development of speech signal processing, such as speech recognition, speaker diarization, etc., has inspired numerous applications of speech technologies. The meeting scenario is one of the most valuable and, at the same time, most challenging scenarios for speech technologies. Because such scenarios have free speaking styles and complex acoustic conditions such as overlapping speech, unknown number of speakers, far-field signals in large conference rooms, noise and reverberation etc.

    However, the lack of large public real meeting data has been a major obstacle for advancement of the field. Since meeting transcription involves numerous related processing components, more informa- tion have to be carefully collected and labelled, such as speaker identity, speech context, onset/offset time, etc. All these information require precise and accurate annotations, which is expensive and time- consuming. Although several relevant datasets have been released, most of them suffer from various limitations, ranging from corpus setup such as corpus size, number of speakers, variety of spatial loca- tions relative to the microphone arrays, collection condition, etc., to corpus content such as recording quality, accented speech, speaking style, etc. Moreover, almost all public available meeting corpora are collected in English, and the differences among different languages limit the development of Mandarin meeting transcription.

    Therefore, we release the AliMeeting corpus, which consists of 120 hours of real recorded Mandarin meeting data, including far-field data collected by 8-channel microphone array as well as near-field data collected by each participants’ headset microphone. Moreover, we will launch the Multi-channel Multi-party Meeting Transcription Challenge (M2MeT), as an ICASSP2022 Signal Processing Grand Challenge. The challenge consists of two tracks, namely speaker diarization and multi-speaker ASR. We provide a detailed introduction of the dateset, rules, evaluation methods and baseline systems, aiming to further promote reproducible research in this field. For details, please also refer to the paper we have already published: M2MET paper

    We will provide the code of the baseline system for speech recognition and speaker diarization in conference scenario as a reference. The goal is to simplify the training and evaluation procedures, so that participants can easily and flexibly experiment and verify the neural network-based method.GitHub link

    All teams need to submit a system description paper along with the submitted results on the final test set. The organizer will select papers with high system ranking and technical quality and include them in the ICASSP2022 Proceedings.

  • Dataset
  • AliMeeting contains 118.75 hours of speech data in total. The dataset is divided into 104.75 hours for training (Train), 4 hours for evaluation (Eval) and 10 hours as test set (Test) for scoring and ranking. Specifically, the Train and Eval sets contain 212 and 8 sessions, respectively. Each session consists of a 15 to 30-minute discussion by a group of participants. The total number of participants in Train and Eval sets is 456 and 25, respectively, with balanced gender coverage. The Train and Eval sets will be released to the participants at the beginning of the challenge, while the Test set audio will be released at the final challenge scoring stage.

    The dataset is collected in 13 meeting venues, which are categorized into three types: small, medium, and large rooms with sizes ranging from 8 m2 to 55 m2. Different rooms give us a variety of acoustic properties and layouts. The detailed parameters of each meeting venue will be released together with the Train data. The type of wall material of the meeting venues covers cement, glass, etc. Other furnishings in meeting venues include sofa, TV, blackboard, fan, air conditioner, plants, etc. During recording, the participants of the meeting sit around the microphone array which is placed on the table and conduct a natural conversation. The microphone-speaker distance ranges from 0.3 m to 5.0 m. All participants are native Chinese speakers speaking Mandarin without strong accents. During the meeting, various kinds of indoor noise including but not limited to clicking, keyboard, door opening/closing, fan, bubble noise, etc., are made naturally. For both Train and Eval sets, the participants are required to remain in the same position during recording. There is no speaker overlap between the Train and Eval set. An example of the recording venue from the Train set is shown in Fig. 1.

    The number of participants within one meeting session ranges from 2 to 4. To ensure the coverage of different overlap ratios, we select various meeting topics during recording, including medical treatment, education, business, organization management, industrial production and other daily routine meetings. The average speech overlap ratio of Train and Eval sets are 42.27 % and 34.76 %, respectively. More details of AliMeeting are shown in Table 1. A detailed overlap ratio distribution of meeting sessions with different numbers of speakers in the Train and Eval set is shown in Table 2.

    We also record the near-field signal of each participant using a headset microphone and ensure that only the participant’s own speech is recorded and transcribed. It is worth noting that the far-field audio recorded by the microphone array and the near-field audio recorded by the headset microphone will be synchronized to a common timeline range. All transcriptions of the speech data are prepared in TextGrid format for each session, which contains the information of the session duration, speakerinformation (number of speaker, speaker-id, gender, etc.), the total number of segments of each speaker, the timestamp and transcription of each segment, etc.


  • Speaker Diarization (Track 1):

  • Speaker diarization, also known as speaker segmentation clustering, addresses the “who speak at when” question by logging speaker-specific speech events on multi-speaker audio data. The constrained data set by the organizer includes AliMeeting, AISHELL-4, and CN-Celeb. The Train and Eval data of AliMeeting provides not only the audio of the far-field meeting scene recorded by the microphone array with multiple speakers talking, but also the near-field audio recorded by each participant’s headset microphone. The Test set, which includes 10 hours of meeting data, will be released shortly (according to the timeline) for challenge scoring and ranking. Meanwhile, the organizers will only provide the far-field audio in the Test set recorded by 8-channel microphone array and the corresponding sentence segmentation timestamp. Note that the organizers will not provide the headset near-field audio and the transcriptions. Participants need to determine the speaker at each time point, and an RTTM file needs to be provided for each session.

    The accuracy of speaker diarization system in this track is measured by Diarization Error Rate (DER) where DER is calculated as: the summed time of three different errors of speaker confusion (SC), false alarm (FA) and missed detection (MD) divided by the total duration time, as shown in: DER=(T_SC+T_FA+T_MD)/T_Total * 100%,
    where T_SC、T_FA and T_MD are the time duration of the three errors, and T_Total is the total time duration.

    Hungarian algorithm is adopted to establish time alignment between the hypothesis outputs and the reference transcript. In order to mitigate the effect of inconsistent annotations and human error in reference transcript, the Rich Transcription 2006 evaluation sets a 0.25 second “no score” collar around every boundary of the reference segment. Since this evaluation scheme has been widely adopted in the literature, we follow this setup in the challenge.

  • Multi-Speaker ASR (Track 2):

  • The challenge of multi-speaker ASR is to handle overlapped speech and to recognize the content of multiple speakers, and the organizer will only provide the Train and Eval data of AliMeeting and AISHELL4 in Track 2 as constrained data. Certainly, the provided data of final test set (Test) is the same as Track 1. Finally, participants are required to transcribe each speaker, but are not required to identify a corresponding speaker for each transcript.

    The accuracy of multi-speaker ASR system in Track 2 is measured by Character Error Rate (CER). The CER compares, for a given hypothesis output, the total number of characters, including spaces, to the minimum number of insertions (Ins), substitutions (Subs) and deletions (Del) of characters that are required to obtain the reference transcript. Specifically, CER is calculated by: CER=(N_Ins+N_Subs+N_Del)/N_Total * 100% , where N_Ins, N_Subs, N_Del are the character number of the three errors, and N_Total is the total number of characters.

    Considering the permutation invariant training (PIT) problem, we propose two schemes to calculate CER of the overlapping speech.

    First, we sort the reference labels according to the start time of each utterance and join the utterances with the token, which called utterance-based first-in first-out (FIFO) method.

    The second methods is based on speaker, where utterances from the same speaker are combined, and then we will calculate all possible concatenation patterns.

  • Sub-track Arrangement

  • For both tracks, we also set up two sub-tracks:

  • ● Sub-track I (Fixed Training Condition):

    Participants can only use the fixed constrained data to build both systems, and the usage of extra data is strictly prohibited. In other words, system building for Track 1 and Track2 is restricted to AliMeeting, AISHELL-4 and CN-Celeb.

    ● Sub-track II (Open Training Condition):
    Besides the fixed constrained data, participants can use any data set publicly available, privately recorded, and manual simulation for system building. However, the participants have to clearly list the data used in the final system description paper. If manually simulated data is used, please describe the data simulation scheme in detail.

  •      Timeline(AOE Time)

  •            • November 17, 2021 : Registration deadline

               • November 19, 2021 : Train and Eval data release

               • January 13, 2022 : Test data release

               • January 17, 2022 : Final results submission deadline

               • January 24, 2022 : System description paper submission deadline

               • January 31, 2022 : Evaluation result and ranking release

               • February 10, 2022 : ICASSP2022 Grand Challenge paper acceptance

               • February 17, 2022 : Camera-ready paper submission deadline

  • Prizes
  • Top three winning teams from sub-track I of each track will be awarded prizes provided by Alibaba Group:

    • First Prize: 5000 USD

    • Second Prize: 2000 USD

    • Third Prize: 1000 USD

  • Guidelines
  • Potential participants from both academia and industry should send an email to m2met.alimeeting@gmail.com to register to the challenge before or by November 10 with the following requirements:
    1.Email subject: [ICASSP2022 M2MeT Challenge Registration] – Team Name - Participating track
    2.Provide team name, affiliation, participating track, team captain as well as members with contacts ().

    The organizer will notify the qualified teams to join the challenge via email in 3 working days. The qualified teams must obey the challenge rules which will be released on the challenge website.

  • Organizers
  • Lei Xie
    AISHELL Foundation, China

    AISHELL Foundation, China

    Bin Ma
    Principal Engineer at Alibaba, Singapore

    Principal Engineer at Alibaba, Singapore

    DeLiang Wang
    Professor, Ohio State University, USA

    Professor, Ohio State University, USA

    Zheng-Hua Tan
    Professor, Aalborg University, Denmark

    Professor, Aalborg University, Denmark

    Kong Aik Lee
    Senior Scientist, Institute for Infocomm Research, A*STAR, Singapore

    Senior Scientist, Institute for Infocomm Research, A*STAR, Singapore

    Zhijie Yan
    Director of Speech Lab at Alibaba, China

    Director of Speech Lab at Alibaba, China

    Yanmin Qian
    Associate Professor, Shanghai Jiao Tong University, China

    Associate Professor, Shanghai Jiao Tong University, China

    Hui Bu
    CEO, AIShell Inc., China

    CEO, AIShell Inc., China

  • Rules

  • All participants should adhere to the following rules to be eligible for the challenge.

    1. 1.Data augmentation is allowed on the original training dataset, including, but not limited to,adding noise or reverberation, speed perturbation and tone change.

    2. 2. The use of Test dataset in any form of non-compliance is strictly prohibited, including but not limited to use the Test dataset to fine-tune or train the model.

    3. 3. Multi-system fusion is allowed, but the systems with same structure and different parameters is not encouraged.

    4. 4. If the DER or CER of the two systems on the Test dataset are the same, the system with lower computation complexity will be judged as the superior one.

    5. 5. If the forced alignment is used to obtain the frame-level classification label, the forced alignment model must be trained on the basis of the data allowed by the corresponding track.

    6. 6. Shallow fusion is allowed to the end-to-end approaches, e.g., LAS, RNNT and Transformer, but the training data of the shallow fusion language model can only come from the transcripts of the allowed training dataset.

    7. 7.The right of final interpretation belongs to the organizer. In case of special circumstances, the organizer will coordinate the interpretation.