linux apt speech recognition

Keywords: pycurl JSON Linux curl

To be revised




Project introduction: Baidu voice is used to recognize and synthesize voice into Chinese, Turing robot is used for intelligent dialogue, and pythonaudio module is used for linux. Because pythonaudio is incompatible, raspberry pier uses arecore to record. The final code is about 150 lines. The code is published on github. https://github.com/luyishisi/python_y Uyinduihua

0. Catalogue:

  • 1: Environment Construction
  • 2: Baidu speech synthesis and recognition
  • 3: Turing Robot
  • 4:linux Audio Resolution Using pythonaudio
  • 5: The raspberry pie uses arecore for recording
  • 6: Linux debugging
  • 7: Major bug resolution
  • 8: Source raspberry pie

1. Environmental Construction

This is critical, and most of the problems in the latter period are environmental incompatibility.

1.1: linux version

# -*- coding: utf-8 -*-
from pyaudio import PyAudio, paInt16
import numpy as np
from datetime import datetime
import wave
import time
import urllib, urllib2, pycurl
import base64
import json
import os
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )

This part of the environment is best built, just need

Installation commands like apt-get install python-wave* are easy to do. Installation modules are essentially looking for installation commands. Half of my job is to insert * after the nouns that modules must have * for fuzzy matching.

If there are modules that do not know how to install, or Baidu, it is not difficult. There is also mpg123 for broadcasting.

1.2: Raspberry Pie Version

If you make a mistake in this blog post, please abandon the pit decisively. Change to command-line recording, and don't bother pyaudio.

##Update the package first
sudo apt-get update
sudo apt-get upgrade
##Install the necessary procedures
sudo apt-get -y install alsa-utils alsa-tools alsa-tools-gui alsamixergui

Major Tools Used

To adjust the speaker's volume through the terminal, you just need to input alsamixer. This is an important recording device you use. The recording volume needs to be set here, and you can clearly see if your sound card has any problems.

The recording device I used was https://item.taobao.com/item.htm?Spm=a1z10.5-c.w4002-3667091491.40.mktumv&id=41424706506.

The recording command uses arecord

arecord,aplay is a recording and playing tool driven by command line ALSA sound card. arecord is a recording program driven by command line ALSA sound card. It supports multiple file formats and sound cards. aplay is a command line playing tool and supports multiple file formats.

Command format: This section needs to be read. It mainly uses three parameters of dfr.

       arecord [flags] [filename]
       aplay [flags] [filename [filename]] ...
Options:
       - h,--help help.
       Version prints version information.
       - l,--list-devices lists all sound cards and digital audio devices.
       - L,--list-pcms lists all PCM definitions.
       - D, --device=NAME specifies the PCM device name.
       - Q - quiet quiet mode.
       - t,--file-type TYPE file type (voc, wav, ray or au).
       - c,--channels=# Set the channel number.
       - f --format=FORMAT format. Formats include: S8 U8 S16_LE S16_BE U16_LE
              U16_BE  S24_LE S24_BE U24_LE U24_BE S32_LE S32_BE U32_LE U32_BE
              FLOAT_LE  FLOAT_BE  FLOAT64_LE  FLOAT64_BE   IEC958_SUBFRAME_LE
              IEC958_SUBFRAME_BE MU_LAW A_LAW IMA_ADPCM MPEG GSM
       - r,--rate= lt; Hz> set frequency.
       - d,--duration= Sets the duration in seconds.
       - s, --sleep-min=# sets the minimum sleep time.
       - M,--mmap MMAP flow.
       - N,--nonblock is set to non-block mode.
       - B, --buffer-time= buffer duration. In subtle units.
       - v,--verbose displays PCM structure and settings.
       - I,--separate-channels are set to a separate file for each channel.

Example:

       aplay -c 1 -t raw -r 22050 -f mu_law foobar
	Play raw file foobar. At 22050 Hz, mono, 8 bits, mu_law format.

       arecord -d 10 -f cd -t wav -D copy foobar.wav
	Record foobar.wav file in CD quality for 10 seconds. Use PCM copy.

2: Baidu speech synthesis and recognition

This part is not very difficult. The test code is as follows.

#speech synthesis
#encoding=utf-8
import wave
import urllib, urllib2, pycurl
import base64
import json
## get access token by api key & secret key
## To get token, you need to fill in your apikey and secretkey
def get_token():
    apiKey = "Ll0c53MSac6GBOtpg22ZSGAU"
    secretKey = "44c8af396038a24e34936227d4a19dc2"

    auth_url = "https://openapi.baidu.com/oauth/2.0/token?grant_type=client_credentials&client_id=" + apiKey + "&client_secret=" + secretKey;

    res = urllib2.urlopen(auth_url)
    json_data = res.read()
    return json.loads(json_data)['access_token']

def dump_res(buf):
    print (buf)

## post audio to server
def use_cloud(token):
    fp = wave.open('2.wav', 'rb')
    ##Sound clips that have been recorded
    nf = fp.getnframes()
    f_len = nf * 2
    audio_data = fp.readframes(nf)

    cuid = "7519663" #Your product id
    srv_url = 'http://vop.baidu.com/server_api' + '?cuid=' + cuid + '&token=' + token
    http_header = [
        'Content-Type: audio/pcm; rate=8000',
        'Content-Length: %d' % f_len
    ]

    c = pycurl.Curl()
    c.setopt(pycurl.URL, str(srv_url)) #curl doesn't support unicode
    #c.setopt(c.RETURNTRANSFER, 1)
    c.setopt(c.HTTPHEADER, http_header)   #must be list, not dict
    c.setopt(c.POST, 1)
    c.setopt(c.CONNECTTIMEOUT, 30)
    c.setopt(c.TIMEOUT, 30)
    c.setopt(c.WRITEFUNCTION, dump_res)
    c.setopt(c.POSTFIELDS, audio_data)
    c.setopt(c.POSTFIELDSIZE, f_len)
    c.perform() #pycurl.perform() has no return val

if __name__ == "__main__":
    token = get_token()
    #Get token
    use_cloud(token)
    #Processing, output inside the function

3: Turing Robot

Official website: http://www.tuling123.com/

Test Code for Turing Robot Part

It's not very, very easy. You have to register and use the keys and APIs they give you. The rest is json's text extraction.

# -*- coding: utf-8 -*-
import urllib
import json

def getHtml(url):
    page = urllib.urlopen(url)
    html = page.read()
    return html

if __name__ == '__main__':

    key = '05ba411481c8cfa61b91124ef7389767'
    api = 'http://www.tuling123.com/openapi/api?key=' + key + '&info='
    while True:
        info = raw_input('I: ')
        request = api + info
        response = getHtml(request)
        dic_json = json.loads(response)
        print 'Robot: '.decode('utf-8') + dic_json['text']

4:linux Audio Resolution Using pythonaudio

This part, on a normal computer, as long as the environment is not a big problem, it is very easy. The code is placed in the overall source code. Here is a little explanation.

This part of the code is not working and can be found in the overall source code. However, this part needs to be extracted slightly for understanding.

The established PA is a pyudio object, which can acquire the current pitch and then detect that when the pitch exceeds 200, it starts recording. At the same time, there is an additional limit of 5 seconds.

NUM_SAMPLES = 2000      # The size of cached blocks in pyAudio
SAMPLING_RATE = 8000    # Sampling frequency
LEVEL = 1500            # Threshold of sound preservation
COUNT_NUM = 20          # Sound recording occurs when COUNT_NUM is larger than LEVEL within NUM_SAMPLES samples
SAVE_LENGTH = 8         # Minimum length of sound recording: SAVE_LENGTH* NUM_SAMPLES sampling
# Turn on Sound Input
pa = PyAudio()
stream = pa.open(format=paInt16, channels=1, rate=SAMPLING_RATE, input=True,
                frames_per_buffer=NUM_SAMPLES)\
string_audio_data = stream.read(NUM_SAMPLES)
    # Converting read data into arrays
    audio_data = np.fromstring(string_audio_data, dtype=np.short)
    # Calculate the number of samples larger than LEVEL
    large_sample_count = np.sum( audio_data > LEVEL )

    temp = np.max(audio_data)
    if temp > 2000 and t == 0:
        t = 1#Open recording
        print "Detect the signal and start recording,Time five seconds."
        begin = time.time()
        print temp

5: The raspberry pie uses arecore for recording

Here is the main record of some of the overall information. In the raspberry pie can successfully run the following commands even ok. Others are the data of a study.

sudo arecord -D "plughw:1,0" -d 5 f1.wav

Parametric Interpretation: - D means to select the device, the external device is plughw:1,0, the internal device is plughw:0,0, the raspberry pie itself does not have a recording module, so there is no internal device. -d 5

This means that the recording time is 5 seconds. If this parameter is not added, the recording will continue until ctrol+C stops. The final generated file name is f1.wav.

Baidu Voice requires 16 bits, so you need to set-f.

Specific PCM instructions are as follows:

This is a method of PCM to express range, so the minimum value is equivalent, the maximum value is equivalent, and the intermediate data level is the corresponding progress, which can be mapped to - 1 ~ 1 range.

  • S8: signed 8 bits, symbolic character = char, range - 128-127
  • U8: unsigned 8 bits, unsigned char, 0-255
  • S16_LE: little endian signed 16 bits, small end signed = short, range - 32768 - 32767
  • S16_BE: big endian signed 16 bits, large end symbolic word = short reverse order (PPC), indicating range - 32768 ~ 32767
  • U16_LE: little endian unsigned 16 bits, small unsigned word = unsigned short, indicating range 0-65535
  • U16_BE: big endian unsigned signed 16 bits, large endian unsigned short reverse order (PPC), meaning range 0-65535
  • There are also S24_LE,S32_LE and so on, which can be used to represent the number of methods, PCM can use these representations.
  • Among the above values, all the minimum values - 128, 0, - 32768, - 32768, 0, 0 are the same for PCM descriptions, representing the minimum, which can be quantified to floating point - 1. All maximum values are also a value, which can be quantified to floating point 1, and other values can be converted in equal proportion.

PCMU should refer to unsigned PCM: can include U8,U16_LE,U16_BE,... PCMA should refer to signed PCM: can include S8,S16_LE,S16_BE,...

View sound card

cat/proc/asound/cards 

cat/proc/asound/modules

6: debugging Linux platform as a whole

The source code is as follows: parsed on the comments

# -*- coding: utf-8 -*-
from pyaudio import PyAudio, paInt16
import numpy as np
from datetime import datetime
import wave
import time
import urllib, urllib2, pycurl
import base64
import json
import os
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
#Some global variables
save_count = 0
save_buffer = []
t = 0
sum = 0
time_flag = 0
flag_num = 0
filename = ''
duihua = '1'

def getHtml(url):
    page = urllib.urlopen(url)
    html = page.read()
    return html

def get_token():
    apiKey = "Ll0c53MSac6GBOtpg22ZSGAU"
    secretKey = "44c8af396038a24e34936227d4a19dc2"
    auth_url = "https://openapi.baidu.com/oauth/2.0/token?grant_type=client_credentials&client_id=" + apiKey + "&client_secret=" + secretKey;
    res = urllib2.urlopen(auth_url)
    json_data = res.read()
    return json.loads(json_data)['access_token']

def dump_res(buf):#Output of Baidu Speech Recognition
    global duihua
    print "String type"
    print (buf)
    a = eval(buf)
    print type(a)
    if a['err_msg']=='success.':
        #print a['result'][0]#At last, we can output and return the statement here.
        duihua = a['result'][0]
        print duihua

def use_cloud(token):#Synthesize
    fp = wave.open(filename, 'rb')
    nf = fp.getnframes()
    f_len = nf * 2
    audio_data = fp.readframes(nf)
    cuid = "7519663" #product id
    srv_url = 'http://vop.baidu.com/server_api' + '?cuid=' + cuid + '&token=' + token
    http_header = [
        'Content-Type: audio/pcm; rate=8000',
        'Content-Length: %d' % f_len
    ]

    c = pycurl.Curl()
    c.setopt(pycurl.URL, str(srv_url)) #curl doesn't support unicode
    #c.setopt(c.RETURNTRANSFER, 1)
    c.setopt(c.HTTPHEADER, http_header)   #must be list, not dict
    c.setopt(c.POST, 1)
    c.setopt(c.CONNECTTIMEOUT, 30)
    c.setopt(c.TIMEOUT, 30)
    c.setopt(c.WRITEFUNCTION, dump_res)
    c.setopt(c.POSTFIELDS, audio_data)
    c.setopt(c.POSTFIELDSIZE, f_len)
    c.perform() #pycurl.perform() has no return val

# Save the data in data to a WAV file called filename
def save_wave_file(filename, data):
    wf = wave.open(filename, 'wb')
    wf.setnchannels(1)
    wf.setsampwidth(2)
    wf.setframerate(SAMPLING_RATE)
    wf.writeframes("".join(data))
    wf.close()


NUM_SAMPLES = 2000      # The size of cached blocks in pyAudio
SAMPLING_RATE = 8000    # Sampling frequency
LEVEL = 1500            # Threshold of sound preservation
COUNT_NUM = 20          # Sound recording occurs when COUNT_NUM is larger than LEVEL within NUM_SAMPLES samples
SAVE_LENGTH = 8         # Minimum length of sound recording: SAVE_LENGTH* NUM_SAMPLES sampling

# Turn on the sound input pyaudio object
pa = PyAudio()
stream = pa.open(format=paInt16, channels=1, rate=SAMPLING_RATE, input=True,
                frames_per_buffer=NUM_SAMPLES)


token = get_token()#Get token
key = '05ba411481c8cfa61b91124ef7389767' #Key and API settings
api = 'http://www.tuling123.com/openapi/api?key=' + key + '&info='

while True:
    # Read in NUM_SAMPLES Samples
    string_audio_data = stream.read(NUM_SAMPLES)
    # Converting read data into arrays
    audio_data = np.fromstring(string_audio_data, dtype=np.short)
    # Calculate the number of samples larger than LEVEL
    large_sample_count = np.sum( audio_data > LEVEL )

    temp = np.max(audio_data)
    if temp > 2000 and t == 0:
        t = 1#Open recording
        print "Detect the signal and start recording,Time five seconds."
        begin = time.time()
        print temp
    if t:
        print np.max(audio_data)
        if np.max(audio_data)<1000:
            sum += 1
            print sum
        end = time.time()
        if end-begin>5:
            time_flag = 1
            print "Five seconds, ready to end"
        # If the number is greater than COUNT_NUM, save at least SAVE_LENGTH blocks
        if large_sample_count > COUNT_NUM:
            save_count = SAVE_LENGTH
        else:
            save_count -= 1

        if save_count < 0:
            save_count = 0

        if save_count > 0:
            # Store the data to be saved in save_buffer
            save_buffer.append(string_audio_data )
        else:
            # Write the data in save_buffer to the WAV file whose name is the time to save
            #if  time_flag:
            if len(save_buffer) > 0  or time_flag:
                #filename = datetime.now().strftime("%Y-%m-%d_%H_%M_%S") + ".wav"#Originally, time was used as a name.
                filename = str(flag_num)+".wav"
                flag_num += 1

                save_wave_file(filename, save_buffer)
                save_buffer = []
                t = 0
                sum =0
                time_flag = 0
                print filename, "Save Successfully Speech Recognition in Progress"
                use_cloud(token)
                print duihua
                info = duihua
                duihua = ""
                request = api + info
                response = getHtml(request)
                dic_json = json.loads(response)

                #print 'Robot: '.decode('utf-8') + dic_json['text']#The trouble here is character encoding.
                #huida = ' '.decode('utf-8') + dic_json['text']
                a = dic_json['text']
                print type(a)
                unicodestring = a

                # Converting Unicode into a normal Python string: "encode"
                utf8string = unicodestring.encode("utf-8")

                print type(utf8string)
                print str(a)
                url = "http://tsn.baidu.com/text2audio?tex="+dic_json['text']+"&lan=zh&per=0&pit=1&spd=7&cuid=7519663&ctp=1&tok=24.a5f341cf81c523356c2307b35603eee6.2592000.1464423912.282335-7519663"
                os.system('mpg123 "%s"'%(url))#Play with mpg123

7: Major bug resolution

In addition to the environmental factors, that is, Chinese encoding, there are object parsing. The source code from Baidu speech recognition returns a dictionary object, and the dictionary object is part of a direct string, some are arrays, first read out the string to determine whether it is succeeeeds. Then read tex. Array t. In Chinese.

Another bug is Chinese encoding.

import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )

#Also

#print'Robot:'. decode('utf-8') + dic_json['text']
#huida = ' '.decode('utf-8') + dic_json['text']
a = dic_json['text']
print type(a)
unicodestring = a

# Converting Unicode into a normal Python string: "encode"
utf8string = unicodestring.encode("utf-8")

Then the main problem of transplanting to raspberry pie is that the aercode command appears that the file directory can not be found. So it means that you chose the wrong sound card, and the recording sound is too small. Use alsamixer to choose clearly.

There is also the problem of recording recognition efficiency, the main problem is that Baidu has his requirements, so we have to set 16 bits. Then listen to the recorded voice again to see if the volume is too large, whether there is a very rough voice. It's better to test separately.

8: Source code - Raspberry pie environment

Pyraudio is a mistake I don't want, so I still bypass it, use aercode to record commands, and then use Python to disable it. The code is much shorter, but it loses the ability to process sound in real time.

# -*- coding: utf-8 -*-
from pyaudio import PyAudio, paInt16
import numpy as np
from datetime import datetime
import wave
import time
import urllib, urllib2, pycurl
import base64
import json
import os
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )

save_count = 0
save_buffer = []
t = 0
sum = 0
time_flag = 0
flag_num = 0
filename = '2.wav'
duihua = '1'

def getHtml(url):
    page = urllib.urlopen(url)
    html = page.read()
    return html

def get_token():
    apiKey = "Ll0c53MSac6GBOtpg22ZSGAU"
    secretKey = "44c8af396038a24e34936227d4a19dc2"
    auth_url = "https://openapi.baidu.com/oauth/2.0/token?grant_type=client_credentials&client_id=" + apiKey + "&client_secret=" + secretKey;
    res = urllib2.urlopen(auth_url)
    json_data = res.read()
    return json.loads(json_data)['access_token']

def dump_res(buf):
    global duihua
    print "String type"
    print (buf)
    a = eval(buf)
    print type(a)
    if a['err_msg']=='success.':
        #print a['result'][0]#At last, we can output and return the statement here.
        duihua = a['result'][0]
        print duihua

def use_cloud(token):
    fp = wave.open(filename, 'rb')
    nf = fp.getnframes()
    f_len = nf * 2
    audio_data = fp.readframes(nf)
    cuid = "7519663" #product id
    srv_url = 'http://vop.baidu.com/server_api' + '?cuid=' + cuid + '&token=' + token
    http_header = [
        'Content-Type: audio/pcm; rate=8000',
        'Content-Length: %d' % f_len
    ]

    c = pycurl.Curl()
    c.setopt(pycurl.URL, str(srv_url)) #curl doesn't support unicode
    #c.setopt(c.RETURNTRANSFER, 1)
    c.setopt(c.HTTPHEADER, http_header)   #must be list, not dict
    c.setopt(c.POST, 1)
    c.setopt(c.CONNECTTIMEOUT, 30)
    c.setopt(c.TIMEOUT, 30)
    c.setopt(c.WRITEFUNCTION, dump_res)
    c.setopt(c.POSTFIELDS, audio_data)
    c.setopt(c.POSTFIELDSIZE, f_len)
    c.perform() #pycurl.perform() has no return val

# Save the data in data to a WAV file called filename
def save_wave_file(filename, data):
    wf = wave.open(filename, 'wb')
    wf.setnchannels(1)
    wf.setsampwidth(2)
    wf.setframerate(SAMPLING_RATE)
    wf.writeframes("".join(data))
    wf.close()

token = get_token()
key = '05ba411481c8cfa61b91124ef7389767'
api = 'http://www.tuling123.com/openapi/api?key=' + key + '&info='

while(True):
    os.system('arecord -D "plughw:1,0" -f S16_LE -d 5 -r 8000 /home/luyi/yuyinduihua/2.wav')
    use_cloud(token)
    print duihua
    info = duihua
    duihua = ""
    request = api   + info
    response = getHtml(request)
    dic_json = json.loads(response)

    a = dic_json['text']
    print type(a)
    unicodestring = a

    # Converting Unicode into a normal Python string: "encode"
    utf8string = unicodestring.encode("utf-8")

    print type(utf8string)
    print str(a)
    url = "http://tsn.baidu.com/text2audio?tex="+dic_json['text']+"&lan=zh&per=0&pit=1&spd=7&cuid=7519663&ctp=1&tok=24.a5f341cf81c523356c2307b35603eee6.2592000.1464423912.282335-7519663"
    os.system('mpg123 "%s"'%(url))

Posted by Swole on Fri, 29 Mar 2019 23:12:30 -0700