Python Simple Speech Recognition and Response

Keywords: Python Google pip JSON

Python Simple Speech Recognition and Response

https://www.cnblogs.com/warcraft/p/10112486.html

 

 

Download the latest version of pyaudio address:

https://www.lfd.uci.edu/~gohlke/pythonlibs/#pyaudio

 

The reason is a colleague who likes to talk about Zen in his work. Yesterday he used to say, "Look, you can see if you care about it." After a few words, I continued, "Don't worry" after he said "Look at it". After many repetitions, I wondered why I didn't need Python to recognize voice and respond to it, just because I didn't have voice recognition.

1. Transliteration from speech to text

Reference resources Python Speech Recognition Ultimate Guide Tucao: bad quality is the worst non censorship machine. There is no space in the quotation module to import speech_recognition as SR should be import speech_recognition as sr; and creating an example of recognizing a different class should be and creating an example of recognizer class. This is not just a flip-flop, how can it be disassembled? But to understand the API is enough.

Voice to Text Services Using Google Cloud Platform [Google Cloud Speech API](https://cloud.google.com/speech/) Because API keys are not required. Actually, it's because of the default key:

def recognize_google(self, audio_data, key=None, language="en-US", show_all=False):
    ...
    if key is None: key = "AIzaSyBOti4mM-6x9WDnZIjIeyEU21OpBXqWBgw
    ...

Two other parameters of the function can also be seen: lanauage (designated recognition language), show_all (False returns the result with the highest recognition rate, True returns json string dictionary data of all recognition results).

Install pip install SpeechRecognition

1.1 Local Voice File Recognition Test

# coding:utf-8

"""
//Local Voice File Recognition Test
"""
import speech_recognition as sr
import sys

say = 'You can take a look.'
r = sr.Recognizer()

# Local voice testing
harvard = sr.AudioFile(sys.path[0]+'/youseesee.wav')
with harvard as source:
    # Denoising
    r.adjust_for_ambient_noise(source, duration=0.2)
    audio = r.record(source)

# speech recognition
test = r.recognize_google(audio, language="cmn-Hans-CN", show_all=True)
print(test)

# Analysis of Speech
flag = False
for t in test['alternative']:
    print(t)
    if say in t['transcript']:
        flag = True
        break
if flag:
    print('Bingo')

I recorded a voice you see ee. wav (the content is gentle (similar to whispering, vocal cords do not vibrate strongly) said you see, you see, for two seconds. Audio file format can be WAV/AIFF/FLAC

AudioFile instance given a WAV/AIFF/FLAC audio file

The denoising function adjust_for_ambient_noise() takes a period of noise in the audio (duration time range, default 1s) to optimize the recognition. Because the original audio is very short, only 0.2s noise is taken here.

The lanaguage parameter range of the conversion function recognize_google() can be obtained from Cloud Speech-to-Text API language support It is learned that "Chinese and Putonghua (simplified Chinese)" are cmn-Hans-CN.

In the previous example, when this parameter is False, the speech recognition result test outputs. Look at it. When it is True, it outputs all possible recognition results.

{
    'alternative':[
        {
            'transcript':'Hehe, look at it.',
            'confidence':0.87500638
        },
        {
            'transcript':'Ha ha you see'
        },
        {
            'transcript':'Brother, you see'
        },
        {
            'transcript':'Brother, look at it.'
        },
        {
            'transcript':'Ha ha, look at it.'
        }
    ],
    'final':True
}

After analyzing the speech, we simply find out whether the recognition result contains the expected value. Look at it, find out one that means correct recognition and matching, and output Bingo!

The complete output of the above example is as follows:

{'alternative': [{'transcript': 'Hehe, look at it.', 'confidence': 0.87500668}, {'transcript': 'Ha ha you see'}, {'transcript': 'Brother, you see'}, {'transcript': 'Brother, look at it.'}, {'transcript': 'Ha ha, look at it.'}], 'final': True}
{'transcript': 'Hehe, look at it.', 'confidence': 0.87500668}
Bingo

Note: If any abnormality occurs:

speech_recognition.RequestError
recognition connection failed: [WinError 10060] The connection attempt failed because the connector did not respond correctly after a period of time or the host did not respond to the connection.

It's because the global agent is not set up.

1.2 Real-time Speech Recognition Test

Change the audio data source from the above audio file to create a microphone instance and record it.

You need to install pyAudio. If pip install pyAudio cannot be installed, you can go to Python Extension Packages Download and install.

# coding:utf-8

"""
//Real-time speech recognition test
"""
import speech_recognition as sr
import logging
logging.basicConfig(level=logging.DEBUG)

while True:
    r = sr.Recognizer()
    # Microphone
    mic = sr.Microphone()

    logging.info('In the recording...')
    with mic as source:
        r.adjust_for_ambient_noise(source)
        audio = r.listen(source)
    logging.info('End of recording, in recognition...')
    test = r.recognize_google(audio, language='cmn-Hans-CN', show_all=True)
    print(test)
    logging.info('end')

The listen() function listens to the recording and stops when it is silent.

until it encounters recognizer_instance.pause_threshold seconds of non-speaking or there is no more audio input.

Wait until the recording is displayed, start talking, and end the recording after silence. Two times in the experiment: once you look at it, and twice you look at it. The results are printed as follows:

INFO:root:In the recording...
INFO:root:End of recording, in recognition...
{'alternative': [{'transcript': 'Look at it.', 'confidence': 0.97500247}], 'final': True}
INFO:root:end
INFO:root:In the recording...
INFO:root:End of recording, in recognition...
{'alternative': [{'transcript': 'Look at it again.', 'confidence': 0.91089392}, {'transcript': 'You're looking at it.'}, {'transcript': 'Guess what.'}, {'transcript': 'You dare to see it again'}, {'transcript': 'You are feeling'}], 'final': True}
INFO:root:end
INFO:root:In the recording...

Recognition rate is very high (baidu-aip has also tried Baidu, because I did not recognize the audio to do so), voice to text conversion is completed.

2. Text-to-speech

Using the pyttsx module is very simple. Under Python 3, it is pyttsx3.

import pyttsx3
engine = pyttsx3.init()
engine.say("The wind is blowing, the rain is cloudy, the green stripes are soft and the flowers are heavy.")
engine.runAndWait()

It's so simple that you can hear the voice read aloud.

3. Identify and respond

Combining the above, we can recognize the voice and respond.

  • Speech Recognition Transliteration
  • Text Regular Matching and Finding Corresponding Response Texts
  • Response (read aloud)
# coding:utf-8

"""
//Speech recognition and response. Using Google Voice Service, you don't need KEY (self-test KEY). https://github.com/Uberi/speech_recognition
"""
import speech_recognition as sr
import pyttsx3
import re
import logging
logging.basicConfig(level=logging.DEBUG)

resource = {
    r"(You can take a look.?){1}.*\1": "I don't look. Let me see you killed again.",
    r"(You can take a look.?)+": "Look at it. Don't worry.",
    r"(you.+what)+": "What's the matter?",
    r"(666|666)+": "What about Brother Panshi 666?",
    r"(Pan|stone|old|Younger brother)+": "666",
}

engine = pyttsx3.init()

while True:
    r = sr.Recognizer()
    # Microphone
    mic = sr.Microphone()

    logging.info('In the recording...')
    with mic as source:
        r.adjust_for_ambient_noise(source)
        audio = r.listen(source)
    logging.info('End of recording, in recognition...')
    test = r.recognize_google(audio, language='cmn-Hans-CN', show_all=True)

    # Analysis of Speech
    logging.info('Analysis of Speech')
    if test:
        flag = False
        message = ''
        for t in test['alternative']:
            logging.debug(t)
            for r, c in resource.items():
                # Match the resource file key (regular) with each recognition result, store the answer and exit if it matches correctly.
                logging.info(r)
                if re.search(r, t['transcript']):
                    flag = True
                    message = c
                    break
            if flag:
                break
        # Text-to-speech
        if message:
            logging.info('bingo....')
            logging.info('say: %s' % message)
            engine.say(message)
            engine.runAndWait()
            logging.info('ok')
    logging.info('end')

The corresponding resource text is

resource = {
    r"(You can take a look.?){1}.*\1": "I don't look. Let me see you killed again.",
    r"(You can take a look.?)+": "Look at it. Don't worry.",
    r"(you.+what)+": "What's the matter?",
    r"(666|666)+": "What about Brother Panshi 666?",
    r"(Pan|stone|old|Younger brother)+": "666",
}

We just use regularity here. In fact, we didn't plan to use regularity at the beginning. When we want to match two times, we think of retrospective and use regularity.

Convenient: For example, if the brother Panshi is not easy to identify, just use (Pan | Shi | Lao | Di) +to find a match; you can see that you can use backtracking 1. Because when matching, we find that it is faster and sometimes matches one look, so we use you to see? To match you, in fact, the latter look? Either you want it or not, but for the sake of illustrating the purpose, it has not been removed.

(See?){1}. *1 matches

Look at it.
Look at it.
Look at it.

So the recognition rate is high. Because the recognition result matches each rule from the beginning to the back, and when it encounters, it completes, so (see?) {1}. * 1 needs to be placed in front of (see?)+ Otherwise, speech recognition can only trigger if you look at it.

Operation identification results:

The voice was spoken six times:

You see, you see, you see, what do you see, Rock, 666, haha (text for illustration visualization, transmission used to be just audio)

INFO:root:In the recording...
INFO:root:End of recording, in recognition...
INFO:root:Analysis of Speech
DEBUG:root:{'transcript': 'You can take a look.', 'confidence': 0.97500253}
INFO:root:(You can take a look.?){1}.*\1
INFO:root:(You can take a look.?)+
INFO:root:bingo....
WARNING:root:say: Look at it. Don't worry.
INFO:root:ok
INFO:root:end
--------------------------------------------------------------------
INFO:root:In the recording...
INFO:root:End of recording, in recognition...
INFO:root:Analysis of Speech
DEBUG:root:{'transcript': 'Look at it.', 'confidence': 0.97500247}
INFO:root:(You can take a look.?){1}.*\1
INFO:root:bingo....
WARNING:root:say: I don't look. Let me see you killed again.
INFO:root:ok
INFO:root:end
--------------------------------------------------------------------
INFO:root:In the recording...
INFO:root:End of recording, in recognition...
INFO:root:Analysis of Speech
DEBUG:root:{'transcript': 'What do you see?', 'confidence': 0.958637}
INFO:root:(You can take a look.?){1}.*\1
INFO:root:(You can take a look.?)+
INFO:root:(you.+what)+
INFO:root:bingo....
WARNING:root:say: What's the matter?
INFO:root:ok
INFO:root:end
--------------------------------------------------------------------
INFO:root:In the recording...
INFO:root:End of recording, in recognition...
INFO:root:Analysis of Speech
DEBUG:root:{'transcript': 'Rock', 'confidence': 0.80128425}
INFO:root:(You can take a look.?){1}.*\1
INFO:root:(You can take a look.?)+
INFO:root:(you.+what)+
INFO:root:(666|666)+
INFO:root:(Pan|stone|old|Younger brother)+
INFO:root:bingo....
WARNING:root:say: 666
INFO:root:ok
INFO:root:end
--------------------------------------------------------------------
INFO:root:In the recording...
INFO:root:End of recording, in recognition...
INFO:root:Analysis of Speech
DEBUG:root:{'transcript': '666', 'confidence': 0.91621482}
INFO:root:(You can take a look.?){1}.*\1
INFO:root:(You can take a look.?)+
INFO:root:(you.+what)+
INFO:root:(666|666)+
INFO:root:bingo....
WARNING:root:say: What about Brother Panshi 666?
INFO:root:ok
INFO:root:end
--------------------------------------------------------------------
INFO:root:In the recording...
INFO:root:End of recording, in recognition...
INFO:root:Analysis of Speech
DEBUG:root:{'transcript': 'Ha ha ha', 'confidence': 0.97387952}
INFO:root:(You can take a look.?){1}.*\1
INFO:root:(You can take a look.?)+
INFO:root:(you.+what)+
INFO:root:(666|666)+
INFO:root:(Pan|stone|old|Younger brother)+
DEBUG:root:{'transcript': 'Ha ha ha ha'}
INFO:root:(You can take a look.?){1}.*\1
INFO:root:(You can take a look.?)+
INFO:root:(you.+what)+
INFO:root:(666|666)+
INFO:root:(Pan|stone|old|Younger brother)+
INFO:root:end
INFO:root:In the recording...

A total of six times, the first five can be identified and matched, the sixth test expected, not responding. INFO is the general output, DEBUG outputs the results recognized by the google service (not all results, the first match ignores the multiple results identified later), WARNING outputs the voice of the response (because it is not recorded in the article, so output to see what is said).

First and last analysis:

For the first time, say you look at the first result identified as {transcript':'look','confidence': 0.97500253}, match the first rule (see?){1}. * fail, then match the second rule (see?)+success, break rule, and break identify the result test['alternative'] loop. Then you can look at the voice output. Don't worry.

INFO:root:(You can take a look.?){1}.*\1
INFO:root:(You can take a look.?)+
INFO:root:bingo....
WARNING:root:say: Look at it. Don't worry.

For the last time, Hahaha identified two results, Hahaha and Hahaha.

{'transcript': 'Ha ha ha', 'confidence': 0.97387952}
{'transcript': 'Ha ha ha ha'}

Each attempt to match all the rules ended in failure.

INFO:root: {1}. * 1
 INFO:root: (Look at it?)+
INFO:root: (You. +What)+
INFO:root:(666 | 666)+
INFO:root: (Pan | Stone | Old | Brother)+

No bingo, only end, and the recognition ends with no response.

It takes less than 60 lines of code to realize the function of speech recognition and response. (I don't like to say that XX lines of code implements XXX functions. The various articles about Python on the public number network are full of such titles, which is very offensive. Short code is the Python modules that are well written, thanks to your predecessors, rather than being complacent with the headlines and attracting some impetuous people to come. Tell yourself.)

p.s. Writing code for more than two hours, writing articles for most of the day, from a vague concept to semantics, also needs to be thought, organized and integrated. What needs to be improved, please give more advice.

Posted by tsfountain on Wed, 31 Jul 2019 20:28:40 -0700