Speech emotion recognition based on CNN+MFCC

Keywords: network Database Python Lambda

In recent years, with the rapid development of information technology, intelligent devices are gradually integrated into people's daily life. As one of the most convenient ways of human-computer interaction, voice has been widely used. It is the goal of countless researchers to make machines understand human language and realize the natural communication with human beings. The main content of speech emotion recognition is to establish a computing system that can analyze and recognize human emotion from speech, and realize the human communication between human and machine.

The main task of speech emotion recognition is to extract the emotion information contained in speech and recognize its category. At present, there are two ways to describe emotion. The first is based on the discrete emotion division, which divides the basic emotions widely used in human daily life into anger, happiness, excitement, sadness, disgust, etc.; the other is based on the continuous dimension emotion division, mainly through different potency and activation degree to distinguish different emotions.

As a classification task, feature selection is the most critical step. The speech feature used in this paper is Mel cepstrum coefficient. For the knowledge of what Mel cepstrum coefficient is and how to extract it, please refer to the article Python voice signal processing.

This paper refers to MITESHPUTHRANNEU/Speech-Emotion-Analyzer In this project, we will start to introduce how to analyze voice emotion through convolutional neural network

Neural network structure

The architecture used is very simple, as follows

data set

I used CASIA's voice emotion database. CASIA Chinese emotional corpus was recorded by the Institute of Automation, Chinese Academy of Sciences. It consists of four professional speakers, six kinds of emotional angry, happy, fear, sad, surprise and neutral, with 9600 different pronunciations. Among them, 300 sentences are of the same text, that is to say, the same text is read with different emotions. These corpora can be used to compare and analyze the acoustic and rhythmic performance under different emotional states; the other 100 sentences are of different texts, and their emotional attribution can be seen from the literal meaning of these texts, so that the recorder can more accurately express his emotions.

But the complete CASIA dataset is charged, so I only found 1200 incomplete datasets. I put the dataset I found on my network disk: https://pan.baidu.com/s/1EsRoKaF17Q_3s2t7OMNibQ.

feature extraction

I use librosa module to extract MFCC. The extraction code is as follows.

%matplotlib inline
import librosa 
import matplotlib.pyplot as plt
import numpy as np

path=r'D:\NLP\dataset\Voice emotion\test.wav'

y,sr = librosa.load(path,sr=None)

def normalizeVoiceLen(y,normalizedLen):
    nframes=len(y)
    y = np.reshape(y,[nframes,1]).T
    #Normalized audio length is 2s,32000 data points
    if(nframes<normalizedLen):
        res=normalizedLen-nframes
        res_data=np.zeros([1,res],dtype=np.float32)
        y = np.reshape(y,[nframes,1]).T
        y=np.c_[y,res_data]
    else:
        y=y[:,0:normalizedLen]
    return y[0]
    
def getNearestLen(framelength,sr):
    framesize = framelength*sr  
    #Find the positive integer power of 2 closest to the current framesize
    nfftdict = {}
    lists = [32,64,128,256,512,1024]
    for i in lists:
        nfftdict[i] = abs(framesize - i)
    sortlist = sorted(nfftdict.items(), key=lambda x: x[1])#In ascending order of difference from current framesize
    framesize = int(sortlist[0][0])#Take the positive integer power of the 2 closest to the current framesize as the new framesize
    return framesize
    
VOICE_LEN=32000
#Get n_ Length of FFT
N_FFT=getNearestLen(0.25,sr)
#Unified sound range is the first two seconds
y=normalizeVoiceLen(y,VOICE_LEN)
print(y.shape)
#Extract mfcc features
mfcc_data=librosa.feature.mfcc(y=y, sr=sr,n_mfcc=13,n_fft=N_FFT,hop_length=int(N_FFT/4))

# Draw a feature map to visualize MFCC. Transpose the matrix so that the time domain is horizontal
plt.matshow(mfcc_data)
plt.title('MFCC')

The function of the above code is to load the sound and take the first two seconds of the sound for emotional analysis. The getNearestLen() function determines an appropriate speech frame length for Fourier transform based on the sampling rate of the voice. And then through librosa.feature.mfcc The () function extracts MFCC features and visualizes them.

The following code extracts the mfcc features from the dataset, averages the mfcc of each frame, and saves the result as a file.

#Extract features
import os
import pickle

counter=0
fileDirCASIA = r'./Voice emotion/CASIA database'

mfccs={}
mfccs['angry']=[]
mfccs['fear']=[]
mfccs['happy']=[]
mfccs['neutral']=[]
mfccs['sad']=[]
mfccs['surprise']=[]
mfccs['disgust']=[]

listdir=os.listdir(fileDirCASIA)
for persondir in listdir:
    if(not r'.' in persondir):
        emotionDirName=os.path.join(fileDirCASIA,persondir)
        emotiondir=os.listdir(emotionDirName)
        for ed in emotiondir:
            if(not r'.' in ed):
                filesDirName=os.path.join(emotionDirName,ed)
                files=os.listdir(filesDirName)
                for fileName in files:
                    if(fileName[-3:]=='wav'):
                        counter+=1
                        fn=os.path.join(filesDirName,fileName)
                        print(str(counter)+fn)
                        y,sr = librosa.load(fn,sr=None)
                        y=normalizeVoiceLen(y,VOICE_LEN)#Normalized length
                        mfcc_data=librosa.feature.mfcc(y=y, sr=sr,n_mfcc=13,n_fft=N_FFT,hop_length=int(N_FFT/4))
                        feature=np.mean(mfcc_data,axis=0)
                        mfccs[ed].append(feature.tolist())
       
with open('mfcc_feature_dict.pkl', 'wb') as f:
    pickle.dump(mfccs, f)

Data preprocessing

The code is as follows:

%matplotlib inline
import pickle
import os
import librosa 
import matplotlib.pyplot as plt
import numpy as np
from keras import layers
from keras import models
from keras import optimizers
from keras.utils import to_categorical

#Read feature
mfccs={}
with open('mfcc_feature_dict.pkl', 'rb') as f:
    mfccs=pickle.load(f)

#Set label
emotionDict={}
emotionDict['angry']=0
emotionDict['fear']=1
emotionDict['happy']=2
emotionDict['neutral']=3
emotionDict['sad']=4
emotionDict['surprise']=5

data=[]
labels=[]
data=data+mfccs['angry']
print(len(mfccs['angry']))
for i in range(len(mfccs['angry'])):
    labels.append(0)
    
data=data+mfccs['fear']
print(len(mfccs['fear']))
for i in range(len(mfccs['fear'])):
    labels.append(1)

print(len(mfccs['happy']))
data=data+mfccs['happy']
for i in range(len(mfccs['happy'])):
    labels.append(2)

print(len(mfccs['neutral']))    
data=data+mfccs['neutral']
for i in range(len(mfccs['neutral'])):
    labels.append(3)

print(len(mfccs['sad'])) 
data=data+mfccs['sad']
for i in range(len(mfccs['sad'])):
    labels.append(4)

print(len(mfccs['surprise'])) 
data=data+mfccs['surprise']
for i in range(len(mfccs['surprise'])):
    labels.append(5)
    
print(len(data))
print(len(labels))

#Set data dimension
data=np.array(data)
data=data.reshape((data.shape[0],data.shape[1],1))

labels=np.array(labels)
labels=to_categorical(labels)

#Data standardization
DATA_MEAN=np.mean(data,axis=0)
DATA_STD=np.std(data,axis=0)

data-=DATA_MEAN
data/=DATA_STD
    //Next, save the parameters, which are needed for model prediction.

paraDict={}
paraDict['mean']=DATA_MEAN
paraDict['std']=DATA_STD
paraDict['emotion']=emotionDict
with open('mfcc_model_para_dict.pkl', 'wb') as f:
    pickle.dump(paraDict, f)

Finally, the data set is scrambled and training data and test data are divided.

ratioTrain=0.8
numTrain=int(data.shape[0]*ratioTrain)
permutation = np.random.permutation(data.shape[0])
data = data[permutation,:]
labels = labels[permutation,:]

x_train=data[:numTrain]
x_val=data[numTrain:]
y_train=labels[:numTrain]
y_val=labels[numTrain:]

print(x_train.shape)
print(y_train.shape)
print(x_val.shape)
print(y_val.shape)

Define model

Use keras to define the model. The code is as follows:

from keras.utils import plot_model
from keras import regularizers

model = models.Sequential()
model.add(layers.Conv1D(256,5,activation='relu',input_shape=(126,1)))
model.add(layers.Conv1D(128,5,padding='same',activation='relu',kernel_regularizer=regularizers.l2(0.001)))
model.add(layers.Dropout(0.2))
model.add(layers.MaxPooling1D(pool_size=(8)))
model.add(layers.Conv1D(128,5,activation='relu',padding='same',kernel_regularizer=regularizers.l2(0.001)))
model.add(layers.Dropout(0.2))
model.add(layers.Conv1D(128,5,activation='relu',padding='same',kernel_regularizer=regularizers.l2(0.001)))
model.add(layers.Dropout(0.2))
model.add(layers.Conv1D(128,5,padding='same',activation='relu',kernel_regularizer=regularizers.l2(0.001)))
model.add(layers.Dropout(0.2))
model.add(layers.MaxPooling1D(pool_size=(3)))
model.add(layers.Conv1D(256,5,padding='same',activation='relu',kernel_regularizer=regularizers.l2(0.001)))
model.add(layers.Dropout(0.2))
model.add(layers.Flatten())
model.add(layers.Dense(6,activation='softmax'))

plot_model(model,to_file='mfcc_model.png',show_shapes=True)
model.summary()

Training model

Compiling and training models

opt = optimizers.rmsprop(lr=0.0001, decay=1e-6)
model.compile(loss='categorical_crossentropy', optimizer=opt,metrics=['accuracy'])
import keras
callbacks_list=[
    keras.callbacks.EarlyStopping(
        monitor='acc',
        patience=50,
    ),
    keras.callbacks.ModelCheckpoint(
        filepath='speechmfcc_model_checkpoint.h5',
        monitor='val_loss',
        save_best_only=True
    ),
    keras.callbacks.TensorBoard(
        log_dir='speechmfcc_train_log'
    )
]
history=model.fit(x_train, y_train, 
                  batch_size=16, 
                  epochs=200, 
                  validation_data=(x_val, y_val),
                 callbacks=callbacks_list)
model.save('speech_mfcc_model.h5')
model.save_weights('speech_mfcc_model_weight.h5')

Visual training results:

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model acc')
plt.ylabel('acc')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

It can be seen from the above figure that the model starts to be fitted after 60 rounds of training, at this time, the training accuracy reaches 70%, and the verification accuracy reaches 50%. After 200 rounds of training, the training accuracy reaches 95%.

test

Finally, the trained model is tested.

from keras.models import load_model
import pickle

model=load_model('speech_mfcc_model.h5')
paradict={}
with open('mfcc_model_para_dict.pkl', 'rb') as f:
    paradict=pickle.load(f)
DATA_MEAN=paradict['mean']
DATA_STD=paradict['std']
emotionDict=paradict['emotion']
edr = dict([(i, t) for t, i in emotionDict.items()])

import librosa

filePath=r'record1.wav'
y,sr = librosa.load(filePath,sr=None)
y=normalizeVoiceLen(y,VOICE_LEN)#Normalized length
mfcc_data=librosa.feature.mfcc(y=y, sr=sr,n_mfcc=13,n_fft=N_FFT,hop_length=int(N_FFT/4))
feature=np.mean(mfcc_data,axis=0)
feature=feature.reshape((126,1))
feature-=DATA_MEAN
feature/=DATA_STD
feature=feature.reshape((1,126,1))
result=model.predict(feature)
index=np.argmax(result, axis=1)[0]
print(edr[index])

Let's share, grow together, and share so that we are not alone in programming. Scan the QR code quickly and study happily with the blogger!

Posted by neodaemon on Sun, 14 Jun 2020 18:29:58 -0700