As a member of the SVDS research team, we often contact with various speech recognition technologies and almost witness the development of speech recognition technologies in recent years. Until a few years ago, most of the most advanced voice technology solutions were phonetic-based, including Pronunciation model (Pronunciation models) and Acoustic model (Acoustic Modelling). And Language models (Language Model). In general, most of these models are centered on Hidden Markov model (HMM) and N-gram model. In the future, we hope to explore some new technologies combined with the latest Speech recognition systems such as Baidu Deep Speech based on these traditional models. Of course, at present, many articles and materials can be found on the Internet to explain and summarize these basic models, but few are elaborated on the differences and characteristics between them.
Therefore, we compared five speech recognition tools based on HMM and N-gram models: CMU Sphinx,Kaldi,HTK,Julius and ISIP. They are all top projects in the open-source world. Unlike commercial speech recognition tools such as Dragon and Cortana, these open-source and free tools can provide developers with greater freedom and lower development costs, therefore, it has always maintained a strong vitality in the development circle.
What needs to be explained in advance is that most of the following analyses come from our subjective experience and refer to other information on the Internet. In addition, this article is not a summary article covering all open-source speech recognition tools. We only compare five of the more mainstream products. In addition, HTK is not strictly open-source. Its code cannot be reorganized and released, nor can it be used for commercial purposes.
For more information about speech recognition tools, click the following link, which lists almost all open-source and non-open-source speech recognition tools.
depending on your familiarity with different programming languages, you may prefer a tool. As shown in the preceding figure, among the five tools listed here, except ISIP, which only supports C ++, all support Python. You can directly find download links in different languages on their official websites. However, Python may not cover all the features of the toolkit, and some features may be designed separately for features of other languages. It is also worth noting that CMU Sphinx also supports Java, C, and other more languages.
the five projects listed here are all derived from academic research.
As can be seen from the name, CMU Sphinx is a product originated from Carnegie Mellon University. Its research and development history can be traced back to about 20 years ago. Currently, it is synchronously updated on GitHub and SourceForge platforms. There are C and Java versions on GitHub, and it is said that only one administrator maintains them respectively. However, SourceForge platform has 9 administrators and more than a dozen developers.
Kaldi originated from a seminar in 2009. The code is currently open-source on GitHub, with a total of 121 contributors.
HTK started from Cambridge University in 1989 and was once commercialized, but now it has returned to Cambridge. As mentioned above, HTK is not a strictly open source tool now, and the update is slow (although its latest version was updated in December 2015, the previous version was updated in 2009, the interval is about 6 years.
Julius started in 1997, and the last major version was updated in September 2016. It is said that its GitHub platform is maintained by three administrators.
ISIP is the first advanced open source speech recognition system, which originated in Mississippi. It was mainly developed between 1996 and 1999. The last version was released in 2011 and stopped updating before GitHub.
in this part, we examined the mail and community discussions of the above five tools.
The discussion of CMU Sphinx Forum was heated and the reply was positive. But its SourceForge and GitHub platform there are many duplicate repository. In contrast, Kaldi users have more interaction modes, including emails, forums and GitHub repository. HTK has a list of emails but no public repository. The forum link on Julius's official website is currently unavailable, and more detailed information may be available on its Japanese official website. ISIP is mainly used for educational purposes, and its mail list is currently unavailable.
Tutorial and example:
CMU Sphinx documents are easy to read, and the explanation is simple and easy to understand, and close to practical operation.
Kaldi's document coverage is also comprehensive, but in my opinion, it is more difficult to understand. In addition, Kaldi includes both speech and deep learning methods in speech recognition solutions.
If you are not familiar with speech recognition, you can have a general understanding of this field through learning HTK official documents (which can be used after registration). At the same time, HTK documents are also applicable to actual product design and use scenarios.
Julius focuses on Japanese, and its latest document is also Japanese, but the team is actively promoting the release of the English version.
The following links provide some Julius-based speech recognition examples.
The last is ISIP. Although it also has some documents, it is not systematic.
even if the main purpose of using these open-source tools is to learn how to train a professional speech recognition model, an out-of-the-box pre-trained model is still an advantage that cannot be ignored.
CMU Sphinx many models that can be directly used, including English, French, Spanish and Italian. For more information, see its documentation.
Kaldi's instruction to decode the existing model is deeply hidden in the document, which is not easy to find. However, we still find a model trained by the contributor in the egs/voxforge subdirectory based on the English VoxForge corpus, you can also run a script in the online-data subdirectory. For more information, see Kaldi repository.
We did not dig deeply into the model training of the other three software packages, but they should contain at least some simple and available pre-training models, and is compatible with VoxForge (VoxForge is a very active crowdsourcing speech recognition database and trained model Library).
In the future, we will launch more articles on the specific application CMU Sphinx and how to apply neural networks to speech recognition. You are welcome to continue your attention.