Here’s a comprehensive guide on Speech Recognition in Python, covering each step in detail.
Speech Recognition in Python
Speech recognition, also known as automatic speech recognition (ASR), is the process of converting spoken language into text. In Python, we can achieve this using various libraries such as SpeechRecognition, CMU Sphinx, Google Speech API, and Deep Learning-based models.
1. Installing Required Libraries
Before we start, we need to install some essential libraries:
pip install SpeechRecognition
pip install pyaudio
pip install numpy
pip install pocketsphinx
- SpeechRecognition: A popular library for handling speech recognition.
- PyAudio: Allows access to the microphone.
- NumPy: Required for processing audio data.
- PocketSphinx: An offline speech recognition engine.
2. Understanding Speech Recognition Libraries in Python
There are multiple engines available for speech recognition:
- Google Web Speech API – High accuracy but requires an internet connection.
- CMU Sphinx (Pocketsphinx) – Works offline but has lower accuracy.
- IBM Speech to Text – Requires IBM Watson API credentials.
- Microsoft Azure Speech – Cloud-based API with good accuracy.
- Amazon Transcribe – AWS-based speech recognition service.
- Deep Learning models – Custom models trained using TensorFlow, PyTorch.
3. Implementing Speech Recognition Using SpeechRecognition Library
The SpeechRecognition
library provides an easy interface to recognize speech from a microphone or an audio file.
Step 1: Importing Required Libraries
import speech_recognition as sr
Step 2: Initializing the Recognizer
recognizer = sr.Recognizer()
Step 3: Capturing Audio from Microphone
with sr.Microphone() as source:
print("Listening...")
recognizer.adjust_for_ambient_noise(source) # Adjusting for background noise
audio = recognizer.listen(source) # Capturing audio
Step 4: Recognizing Speech
try:
text = recognizer.recognize_google(audio)
print("You said:", text)
except sr.UnknownValueError:
print("Could not understand audio")
except sr.RequestError:
print("API unavailable")
Explanation:
recognizer.adjust_for_ambient_noise(source)
: Reduces background noise.recognizer.listen(source)
: Captures audio from the microphone.recognizer.recognize_google(audio)
: Uses Google Speech API to convert speech to text.
4. Speech Recognition from an Audio File
We can also recognize speech from a pre-recorded WAV file.
Step 1: Load the Audio File
audio_file = "sample.wav"
with sr.AudioFile(audio_file) as source:
recognizer.adjust_for_ambient_noise(source)
audio = recognizer.record(source)
Step 2: Recognize Speech from the Audio File
try:
text = recognizer.recognize_google(audio)
print("Transcription:", text)
except sr.UnknownValueError:
print("Could not understand the audio")
except sr.RequestError:
print("Could not request results from the service")
5. Using Different Speech Recognition Engines
The recognize_google
method can be replaced with other services:
recognizer.recognize_sphinx(audio) # Offline recognition using CMU Sphinx
recognizer.recognize_ibm(audio, username="API_KEY", password="API_SECRET")
recognizer.recognize_azure(audio, key="AZURE_KEY", region="REGION")
recognizer.recognize_amazon(audio, key="AWS_KEY", secret="AWS_SECRET")
6. Handling Noisy Environments
In a noisy environment, background noise can interfere with recognition. To handle this:
recognizer.energy_threshold = 4000 # Adjust the threshold
recognizer.dynamic_energy_threshold = True
recognizer.pause_threshold = 0.8 # Wait time before speech recognition starts
7. Using Deep Learning Models for Speech Recognition
For advanced applications, deep learning-based models like DeepSpeech, Wav2Vec2, and Whisper AI can be used.
Using DeepSpeech (Mozilla)
pip install deepspeech
import deepspeech
model_file = "deepspeech-0.9.3-models.pbmm"
model = deepspeech.Model(model_file)
audio_file = "sample.wav"
# Convert audio to text
transcription = model.stt(audio_file)
print("Transcription:", transcription)
Using OpenAI Whisper
pip install openai-whisper
import whisper
model = whisper.load_model("base")
result = model.transcribe("sample.wav")
print(result["text"])
8. Building a Simple Voice Assistant
We can integrate speech recognition into a voice assistant.
import speech_recognition as sr
import pyttsx3
recognizer = sr.Recognizer()
engine = pyttsx3.init()
with sr.Microphone() as source:
print("Speak something...")
audio = recognizer.listen(source)
try:
text = recognizer.recognize_google(audio)
print("You said:", text)
engine.say(f"You said: {text}")
engine.runAndWait()
except:
print("Sorry, I couldn't understand.")
9. Converting Text to Speech
We can use the pyttsx3
library for text-to-speech conversion.
import pyttsx3
engine = pyttsx3.init()
engine.say("Hello! How are you?")
engine.runAndWait()
10. Applications of Speech Recognition
Speech recognition is widely used in:
- Voice Assistants (Google Assistant, Siri, Alexa)
- Transcription Services (Google Docs Voice Typing)
- Automated Customer Support (Chatbots, IVR systems)
- Accessibility for Disabled Users (Speech-to-text for hearing-impaired individuals)
- Command Control Applications (Smart Home, IoT)