Modeling method for qualitative analysis of near infrared spectroscopy based on one-dimensional convolution network (1-D CNN)

Keywords: Python Machine Learning Deep Learning

Catalogue of series articles

Near infrared spectroscopy is a cross cutting field, which requires cooperation in chemistry, computer science, bioscience and other fields. Therefore, under the guidance of (teacher Yang Huihua team of Beijing University of Posts and Telecommunications), we are preparing to open source the traditional classical algorithms such as PLS, SVM, ANN and RF, preprocessing such as SG, MSC, first-order derivative and second-order derivative, wavelength selection algorithms such as GA, and the latest deep learning algorithms such as CNN and AE, To help other professionals more easily establish near-infrared spectral models with good prediction ability and robustness.

preface

NIRS is an electromagnetic wave between visible light and mid infrared light, with a wavelength range of (1100 ∼ 2526 nm. Due to the near infrared spectral region and hydrogen containing groups (OH, NH, CH, SH) in organic molecules The frequency combination of vibration is consistent with the absorption region of frequency doubling at all levels. The characteristic information of hydrogen containing groups of organic molecules in the sample can be obtained by scanning the near-infrared spectrum of the sample, which is often used as an effective carrier to obtain the sample information. The detection method based on NIRS has the advantages of convenience, high efficiency, accuracy, low cost, on-site detection and no damage to the sample, and is widely used in many fields Various detection fields. However, near-infrared spectroscopy has some problems, such as spectral bandwidth, serious overlap, weak absorption signal, complex information analysis, etc. unlike common chemical analysis methods, it can only be used as an indirect measurement method, and can not directly analyze the content or category of the measured sample. It depends on chemometric methods, which is between the attribute value of the sample to be measured and the near-infrared spectrum data A correlation model (or Calibration Model) is established, and then the near-infrared spectra of unknown samples are predicted by the model to obtain the predicted values of various components. The existing near-infrared modeling methods mainly include classical modeling (preprocessing + wavelength screening for feature dimensionality reduction and highlighting, and then modeling by pls and svm algorithms) and deep learning methods (end to end modeling has low dependence on preprocessing, wavelength selection, etc.)

This paper mainly describes CNN's qualitative analysis and modeling method of near infrared spectroscopy

1, Data source

Open source drugs are used, with a total of 310 samples and 404 variables for each sample. They are divided into four categories according to active ingredients, Download address
The picture is as follows:

! [drug data spectrum]( https://img-blog.csdnimg.cn/7d76d4b1ed4d4f71a150298bdae3da74.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBARWNob19Db2Rl,size_20,color_FFFFFF,t_70,g_se,x_16

2, Code interpretation

1.1 reading data

#Load data 
def TableDataLoad(tp, test_ratio, start, end, seed):

    # global data_x
    data_path = './/Data//table.csv'
    Rawdata = np.loadtxt(open(data_path, 'rb'), dtype=np.float64, delimiter=',', skiprows=0)
    table_random_state = seed

    if tp =='raw':
        data_x = Rawdata[0:, start:end]
    if tp =='SG':
        SGdata_path = './/Data//TableSG.csv'
        data = np.loadtxt(open(SGdata_path, 'rb'), dtype=np.float64, delimiter=',', skiprows=0)
        data_x = data[0:, start:end]
    if tp =='SNV':
        SNVata_path = './/Data//TableSNV.csv'
        data = np.loadtxt(open(SNVata_path, 'rb'), dtype=np.float64, delimiter=',', skiprows=0)
        data_x = data[0:, start:end]
    if tp == 'MSC':
        MSCdata_path = './/Data//TableMSC.csv'
        data = np.loadtxt(open(MSCdata_path, 'rb'), dtype=np.float64, delimiter=',', skiprows=0)
        data_x = data[0:, start:end]
    data_y = Rawdata[0:, -1]
    x_col = np.linspace(7400, 10507, 404)
    plotspc(x_col, data_x[:, :], tp=0)

    x_data = np.array(data_x)
    y_data = np.array(data_y)
    
   

    print('Training set size:{}'.format(len(X_train[:,0])))
    print('Test set size:{}'.format(len(X_test[:,0])))


1.2 dividing data sets

 X_train, X_test, y_train, y_test = train_test_split(x_data, y_data, test_size=test_ratio,random_state=table_random_state)

1.3 return the training set and test set and draw spectral pictures

    data_train, data_test = ZspPocess(X_train, X_test,y_train,y_test,need=True) #for cnn :false only used in proseesing comparsion
    return data_train, data_test

! [displayed spectral curve]( https://img-blog.csdnimg.cn/2d09358a92d4476fb073c3705d57561d.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBARWNob19Db2Rl,size_20,color_FFFFFF,t_70,g_se,x_16

2.1 establishment of 1-D CNN model

class CNN3Lyaers(nn.Module):
    def __init__(self, nls):
        super(CNN3Lyaers, self).__init__()
        self.CONV1 = nn.Sequential(
            nn.Conv1d(1, 64, 21, 1),
            nn.BatchNorm1d(64),  # Homogenize the output
            nn.ReLU(),
            nn.MaxPool1d(3, 3)
        )
        self.CONV2 = nn.Sequential(
            nn.Conv1d(64, 64, 19, 1),
            nn.BatchNorm1d(64),  # Homogenize the output
            nn.ReLU(),
            nn.MaxPool1d(3, 3)
        )
        self.CONV3 = nn.Sequential(
            nn.Conv1d(64, 64, 17, 1),
            nn.BatchNorm1d(64),  # Homogenize the output
            nn.ReLU(),
            nn.MaxPool1d(3, 3),
        )
        self.fc = nn.Sequential(
            # nn.Linear(4224, nls)
            nn.Linear(384, nls)
        )

    def forward(self, x):
        x = self.CONV1(x)
        x = self.CONV2(x)
        x = self.CONV3(x)
        x = x.view(x.size(0), -1)
        # print(x.size())
        out = self.fc(x)
        out = F.softmax(out,dim=1)
        return out

2.2 training 1-D CNN model

def CNNTrain(tp, test_ratio, BATCH_SIZE, n_epochs, nls):

    # data_train, data_test = DataLoad(tp, test_ratio, 0, 404)
    data_train, data_test = TableDataLoad(tp, test_ratio, 0, 404, seed=80)
    train_loader = torch.utils.data.DataLoader(data_train, batch_size=BATCH_SIZE, shuffle=True)
    test_loader = torch.utils.data.DataLoader(data_test, batch_size=BATCH_SIZE, shuffle=True)

    store_path = ".//model//all//CNN18"

    model = CNN3Lyaers(nls=nls).to(device)
    optimizer = optim.Adam(model.parameters(),
                           lr=0.0001,weight_decay=0.0001)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', factor=0.5, verbose=1, eps=1e-06,
                                                           patience=10)
    criterion = nn.CrossEntropyLoss().to(device)  # The loss function is a focal loss function, which is often used in multi classification problems with category imbalance
    early_stopping = EarlyStopping(patience=30, delta=1e-4, path=store_path, verbose=False)

    for epoch in range(n_epochs):
        train_acc = []
        for i, data in enumerate(train_loader):  # gives batch data, normalize x when iterate train_loader
            model.train()  # No training
            inputs, labels = data  # Both input and label are equal to data
            inputs = Variable(inputs).type(torch.FloatTensor).to(device)  # batch x
            labels = Variable(labels).type(torch.LongTensor).to(device)  # batch y
            output = model(inputs)  # cnn output
            trian_loss = criterion(output, labels)  # cross entropy loss
            optimizer.zero_grad()  # clear gradients for this training step
            trian_loss.backward()  # backpropagation, compute gradients
            optimizer.step()  # apply gradients
            _, predicted = torch.max(output.data, 1)
            y_predicted = predicted.detach().cpu().numpy()
            y_label = labels.detach().cpu().numpy()
            acc = accuracy_score(y_label, y_predicted)
            train_acc.append(acc)

        with torch.no_grad():  # No gradient
            test_acc = []
            testloss = []
            for i, data in enumerate(test_loader):
                model.eval()  # No training
                inputs, labels = data  # Both input and label are equal to data
                inputs = Variable(inputs).type(torch.FloatTensor).to(device)  # batch x
                labels = Variable(labels).type(torch.LongTensor).to(device)  # batch y
                outputs = model(inputs)  # The output is equal to the input after entering the network
                test_loss = criterion(outputs, labels)  # cross entropy loss
                _, predicted = torch.max(outputs.data,1)
                predicted = predicted.cpu().numpy()
                labels = labels.cpu().numpy()
                acc = accuracy_score(labels, predicted)
                test_acc.append(acc)
                testloss.append(test_loss.item())
                avg_loss = np.mean(testloss)

            scheduler.step(avg_loss)
            early_stopping(avg_loss, model)
            if early_stopping.early_stop:
                print(f'Early stopping! Best validation loss: {early_stopping.get_best_score()}')
                break

2.3 testing 1-D CNN model

def CNNtest(tp, test_ratio, BATCH_SIZE, nls):
    # data_train, data_test = DataLoad(tp, test_ratio, 0, 404)
    data_train, data_test = TableDataLoad(tp, test_ratio, 0, 404, seed=80)
    test_loader = torch.utils.data.DataLoader(data_test, batch_size=BATCH_SIZE, shuffle=True)

    store_path = ".//model//all//CNN18"

    model = CNN3Lyaers(nls=nls).to(device)

    model.load_state_dict(torch.load(store_path))
    test_acc = []
    for i, data in enumerate(test_loader):
        model.eval()  # No training
        inputs, labels = data  # Both input and label are equal to data
        inputs = Variable(inputs).type(torch.FloatTensor).to(device)  # batch x
        labels = Variable(labels).type(torch.LongTensor).to(device)  # batch y
        outputs = model(inputs)  # The output is equal to the input after entering the network
        _, predicted = torch.max(outputs.data, 1)
        predicted = predicted.cpu().numpy()
        labels = labels.cpu().numpy()
        acc = accuracy_score(labels, predicted)
        test_acc.append(acc)
    return np.mean(test_acc)

3 run the CNN model and view the results with

Get model accuracy

In order to take a closer look at the causes of model errors, we use the confusion matrix to look further, and the results are as follows:

summary

The complete code is available from GitHub warehouse
The code is for academic use only. If you have any questions, please contact QQ: 1427950662, wechat: Fu_siry

Posted by mgallforever on Sat, 20 Nov 2021 15:07:42 -0800