Eyes, JAPAN Blog > Introduce to Sphinx on Speech Recognition

Introduce to Sphinx on Speech Recognition


Introduction of Sphinx

CMU Sphinx (Sphinx for short) is the general term for a series of speech recognition systems developed by Carnegie Mellon University. In 2000, Carnegie Mellon’s Sphinx team committed to open source several speech recognizer components, including Sphinx 2 and later Sphinx 3 (2001). The speech decoder comes with acoustic models and sample applications. Available resources include acoustic model training software, language model editing software and voice dictionary cmudict.


CMU Sphinx is a leading speech recognition toolkit with various tools for building speech applications. CMU Sphinx contains many development packages for different tasks and applications. Sometimes, what to choose is confusing. Let’s introduce the purpose of each development kit:

Pocketsphinx — lightweight recognizer library written in

Sphinxtrain — acoustic model training tools

Sphinxbase — support library required by Pocketsphinx and Sphinxtrain

Sphinx4 — adjustable, modifiable recognizer written in Java

Types include acoustic model, language model, phonetic dictionary

The acoustic model contains the acoustic characteristics of each sentence. There are context-independent models that contain attributes (the most likely feature vector for each phoneme) and context-dependent attributes (built from contextual speech).

The language model is used to limit word searches. It defines which words can follow previously recognized words (remember, matching is a sequential process) and helps to significantly limit the matching process by stripping out impossible words. The most commonly used language models are n-gram language models-they contain statistics of word sequences and finite-state language models-they define speech sequences through finite-state automation (sometimes with weights). To achieve high accuracy, your language model must be very successful in terms of search space limitations. This means it should predict the next word well. A language model usually limits the vocabulary taking into account the words it contains. This is a problem of name recognition. To solve this problem, a language model can contain small pieces like subwords or even phonemes. Please note that the search space limit, in this case, is usually poor, and the corresponding recognition accuracy is lower than the word-based language model.

The phonetic dictionary contains the mapping from words to phonemes. This mapping is not very effective. For example, only two to three pronunciation variants are recorded in it. However, it is practical enough in most cases. Dictionaries are not the only way to map words to phones. You can also use some of the complex functions learned by machine learning algorithms.


Python Quickly Build Demo

Install the SpeechRecognition module. When calling after installing the library, the library name is speech_recognition && install PocketSphinx

pip install SpeechRecognition –user

pip install PocketSphinx –user

(Examle on Mac)

Python code

import speech_recognition as sr

# obtain audio from the microphone

r = sr.Recognizer()

harvard = sr.AudioFile(r”/Users/tang/Downloads/LibriSpeech/dev-clean/6295/64301/6295-64301-0000.flac”)

with harvard as source:

audio = r.record(source)

# recognize speech using Sphinx


print(“Sphinx thinks you said \n” + r.recognize_sphinx(audio))

except sr.UnknownValueError:

print(“Sphinx could not understand audio”)

except sr.RequestError as e:

print(“Sphinx error; {0}”.format(e))


Sphinx thinks you said

when winter evening as soon as his work was over for the day joseph lock the door of his smithee washed himself well put on clean clothes and taking his violin and set out for test bridge mary was expecting him to tea

  • このエントリーをはてなブックマークに追加

Comments are closed.