Machine learning notes (day03)

Keywords: Python Machine Learning sklearn

1. Linear relationship model

A function that predicts through a linear combination of attributes:

w is the weight, b is the offset term, which can be understood as:

  2. Linear regression (iteration, self-learning)

Definition: linear regression is a regression analysis modeled between one or more independent variables and dependent variables. It can be a linear combination between one or more independent variables (a kind of linear regression)

Univariate linear regression: only one variable (only one eigenvalue) is involved

Multiple linear regression: involving two or more variables (multiple eigenvalues)

Formula:

  Where W and X are matrices:

 

 

  3. Matrix operation (linear regression operation)

  4. Loss function strategy (error size)

● Is the true value of the ith training sample

 ● It is the combined prediction function of the eigenvalues of the ith training sample

  Definition of total loss (least square method):

The purpose is to reduce this loss

  How to find w in the model to minimize the loss? (the purpose is to find the W value corresponding to the minimum loss)

  (optimization method) solution 1: normal equation of least square method:

Solution:

X is the eigenvalue matrix and y is the target value matrix

  Disadvantages: when the features are too complex, the solution speed is too slow

  (optimization method) solution 2: gradient descent of least square method (important)

  Take W0 and W1 in a single variable as an example:

 α The learning rate needs to be specified manually,Indicates the direction

Understanding: follow the descending direction of this function to find the lowest point of the valley, and then update the W value

  Usage: facing the task of huge training data scale

  5.sklearn linear regression normal equation, gradient descent API

● sklearn.linear_model.LinearRegression -- normal equation

           ● ordinary least squares linear regression

           ● coef_: regression coefficient

● sklean.linear_ Model.sgdrepression -- gradient descent

          ● minimize the linear model by using SGD

          ● coef_: regression coefficient

  6. Linear regression example - Boston house price forecast

House price characteristics:

Boston house price data set analysis process:

         1. Boston area house price data acquisition

         2. Boston area house price data segmentation

         3. Standardized processing of training and test data

         4. The simplest linear regression model linear regression and gradient descent estimation SGDRegression are used to predict house prices

 demo:

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

def mylinear():
    """
    Linear regression prediction of house price
    :return: None
    """
    # get data
    lb = load_boston()

    # The data is divided into training set and test set
    x_train, x_test, y_train, y_test = train_test_split(lb.data, lb.target, test_size=0.25)
    # print(y_train,y_test)

    # Carry out standardization processing, and standardize the eigenvalue and target values respectively
    # Eigenvalue standardization
    std_x = StandardScaler()
    x_train = std_x.fit_transform(x_train)
    x_test = std_x.transform(x_test)
    # Target value standardization, y_ train,y_ The test format is one-dimensional, and the standard scaler requires two-dimensional, so reshape the shape with reshape
    std_y = StandardScaler()
    y_train = std_y.fit_transform(y_train.reshape(-1,1))
    y_test = std_y.transform(y_test.reshape(-1,1))

    #Algorithm model
    #Solution of normal equation and prediction results
    lr = LinearRegression()
    lr.fit(x_train,y_train)
    print(lr.coef_)
    #Forecast test set house price
    y_lr_predict = std_y.inverse_transform(lr.predict(x_test))
    print("The predicted price of each house in the normal equation test set:",y_lr_predict)
    print("Mean square error of normal equation:",mean_squared_error(std_y.inverse_transform(y_test),y_lr_predict))

    #gradient descent 
    sgd = SGDRegressor()
    sgd.fit(x_train, y_train)
    print(sgd.coef_)
    # Forecast test set house price
    y_sgd_predict = std_y.inverse_transform(sgd.predict(x_test))
    print("Forecast price of each house in the test set:", y_sgd_predict)
    print("Mean square error of gradient descent:", mean_squared_error(std_y.inverse_transform(y_test), y_sgd_predict))
    return None


if __name__ == "__main__":
    mylinear()

Operation results:

 

Regression performance evaluation:

(mean square error (MSE) evaluation mechanism:

● mean_squared_error(y_true,y_pred)

            ● mean square error regression loss

            ● y_true: true value

            ● y_pred: predicted value

            ● return: floating point number result

         Note: the real value and the predicted value are the values before standardization

  7. Over fitting and under fitting

Over fitting: a hypothesis can get better fitting than other hypotheses on the training data, but it can not fit the data well on the data set outside the training data. At this time, it is considered that this hypothesis has over fitting phenomenon. (the model is too complex)

Under fitting: a hypothesis can not get better fitting on the training data, but it can not fit the data well on the data set other than the training data. At this time, it is considered that this hypothesis has the phenomenon of under fitting. (the model is too simple)

Causes and solutions of under fitting:

            ● cause: too few characteristics of data are learned

            ●   Solution: increase the number of features of the data

Over fitting causes and Solutions

            ●   Reason: there are too many original features and some noisy features. The model is too complex because the model tries to take into account all test data points

            ●   Solution: select features and eliminate features with high relevance (it is difficult to do);   Cross validation (make all data trained); Regularization

  L2 regularization:

         Function: it can make each element of W very small and close to 0

         Advantages: the smaller the parameter, the simpler the model, and the simpler the model, the less likely it is to produce over fitting.

8. Ridge regression - linear regression with L2 regularization

● sklean.linear_model.Ridge(alpha=1.0)

            ● linear least squares with I2 regularization

            ● alpha: regularization strength      λ      0~1       1~10

            ● coef_: regression coefficient

Note: it will not be equal to 0, but will be infinitely close to 0

demo:

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression, SGDRegressor,Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

def ridge():
    """
    Linear regression prediction of house price
    :return: None
    """
    # get data
    lb = load_boston()

    # The data is divided into training set and test set
    x_train, x_test, y_train, y_test = train_test_split(lb.data, lb.target, test_size=0.25)
    # print(y_train,y_test)

    # Carry out standardization processing, and standardize the eigenvalue and target values respectively
    # Eigenvalue standardization
    std_x = StandardScaler()
    x_train = std_x.fit_transform(x_train)
    x_test = std_x.transform(x_test)
    # Target value standardization, y_ train,y_ The test format is one-dimensional, and the standard scaler requires two-dimensional, so reshape the shape with reshape
    std_y = StandardScaler()
    y_train = std_y.fit_transform(y_train.reshape(-1,1))
    y_test = std_y.transform(y_test.reshape(-1,1))

    #Algorithm model
    #Ridge regression
    rd = Ridge(alpha=1.0)
    rd.fit(x_train, y_train)
    print(rd.coef_)
    # Forecast test set house price
    y_rd_predict = std_y.inverse_transform(rd.predict(x_test))
    print("Forecast price of each house in the test set:", y_rd_predict)
    print("Mean square error of ridge regression:", mean_squared_error(std_y.inverse_transform(y_test), y_rd_predict))
    return None


if __name__ == "__main__":
    ridge()

Operation results:

  Comparison between linear regression and ridge regression:

         ● ridge regression: the regression coefficient obtained by regression is more practical and reliable. In addition, it can make the fluctuation range of estimation parameters smaller and more stable. It has great practical value in the research of too many morbid data.

9. Classification algorithm - logistic regression (Solving binary classification problem)

Logistic regression formula:

 

Activation function Sigmoid: regression to classification

10. Logistic regression loss function and optimization

It is the same as the principle of linear regression, but because it is a classification problem, the loss function is different and can only be solved by gradient descent

Log likelihood loss function:

  Complete loss function:

  When y = 1   When, the classification result is 1 (positive): (horizontal axis classification result, vertical axis loss function)

   When y = 0   When, the classification result is 0 (negative): (horizontal axis classification result, vertical axis loss function)

The log likelihood loss may fall into the local optimal solution (the mean square error has only one value and there is no local optimal problem):

11. Logistic regression API

● sklearn.lnear_model.LogisticRegression(penal='l2',C=1.0)

            ● Logistic regression classifier

            ● coef_: regression coefficient

  12. Logistic regression cases

Data address: https://archive.ics.uci.edu/ml/machine-learning-databases/

demo:

#Model API
from sklearn.linear_model import LogisticRegression
#Recall API
from sklearn.metrics import classification_report
#Split dataset API
from sklearn.model_selection import train_test_split
#Standardized API
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np


def logistic():
    """
    Logistic regression classification for tumor prediction
    :return:    None
    """
    # Construct column label name
    column = ['Sample code number', 'Clump Thickness',
              'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion',
              'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
              'Normal Nucleoli', 'Mitoses', 'Class']
    # Read data
    data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/"
                       "breast-cancer-wisconsin/breast-cancer-wisconsin.data", names=column)
    # print(data)

    #Handling of missing values
    data = data.replace(to_replace="?",value=np.nan)
    #Delete missing values
    data = data.dropna()
    #Dataset segmentation
    x_train, x_test, y_train, y_test = train_test_split(data[column[1:10]], data[column[10]], test_size=0.25)
    #Standardized treatment
    std = StandardScaler()
    x_train = std.fit_transform(x_train)
    x_test = std.transform(x_test)
    #Logistic regression prediction
    lg = LogisticRegression(C=1.0)
    lg.fit(x_train,y_train)
    print(lg.coef_)
    print("Prediction accuracy:",lg.score(x_test,y_test))
    #recall 
    y_predict = lg.predict(x_test)
    print("Recall rate:",classification_report(y_test,y_predict,labels=[2,4],target_names=["Benign","malignant"]))
    return None


if __name__ == "__main__":
    logistic()

Operation results:

  Summary (II Classification questions):

         Application: advertising click through rate prediction, whether it is sick, financial fraud, whether it is a false account

         Advantages: it is suitable for scenes where a classification probability needs to be obtained. It is simple and fast

         Disadvantages: it is difficult to deal with multi classification problems

  13. Generation model and discrimination model

The criterion of the two is whether there is a priori probability

14. Unsupervised learning K-means clustering algorithm

Algorithm flow:

  Steps:

         1. K points in the feature space are randomly set as the initial clustering center

         2. For each other point, calculate the distance to K centers. For unknown points, select the nearest cluster center as the marker category

         3. Then, after facing the marked cluster center, recalculate the new center point (average value) of each cluster

         4. If the calculated new center point is the same as the original center point, it shall be terminated, otherwise the second step of the process shall be repeated

K-meansAPI:

● sklean.cluster.KMeans(n_clusters=8,init='k-means++')

            ● k-means clustering

            ● n_cluster: number of cluster centers started

            ● init: initialization method, default to 'k-means + +'

            ● labels_: The default tag type can be compared with the real value (not the value)

  Case demo:

# coding:utf-8
 
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
 
# Read data from four tables
prior = pd.read_csv('./data/order_products__prior.csv')
products = pd.read_csv('./data/products.csv')
orders = pd.read_csv('./data/orders.csv')
aisles = pd.read_csv('./data/aisles.csv')
# Merge four tables into one table (user - item category)
_mg = pd.merge(prior, products, on=['product_id', 'product_id'])
_mg = pd.merge(_mg, orders, on=['order_id', 'order_id'])
mt = pd.merge(_mg, aisles, on=['aisle_id', 'aisle_id'])
print(mt.head(10))
# Crosstab (special grouping tool)
cross = pd.crosstab(mt['user_id'], mt['aisle'])
# Principal component analysis
pca = PCA(n_components=0.9)
data = pca.fit_transform(cross)
print(data.shape)
# Reduce the number of samples
x = data[:500]
# Suppose that users are divided into four categories
km = KMeans(n_clusters=4)
km.fit(x)
predict = km.predict(x)
print(predict)
# Displays the results of the clustering
plt.figure(figsize=(10, 10))
# Create a list of four colors
colored = ['orange', 'green', 'blue', 'purple']
color = [colored[i] for i in predict]
plt.scatter(x[:, 1], x[:, 20], color=color)
plt.xlabel("1")
plt.ylabel("20")
plt.show()
print(silhouette_score(x, predict))

Output results:

   order_id  product_id  ...  days_since_prior_order  aisle
0         2       33120  ...                     8.0   eggs
1        26       33120  ...                     7.0   eggs
2       120       33120  ...                    10.0   eggs
3       327       33120  ...                     8.0   eggs
4       390       33120  ...                     9.0   eggs
5       537       33120  ...                     3.0   eggs
6       582       33120  ...                    10.0   eggs
7       608       33120  ...                    12.0   eggs
8       623       33120  ...                     3.0   eggs
9       689       33120  ...                     3.0   eggs
 
[10 rows x 14 columns]
(206209, 27)
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 0
 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1
 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 0 1 1 1 0 1 1
 1 1 1 3 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 2 1 1 1 0 1 1 1 1 1 1 1 1
 3 1 1 1 0 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 3 0 1 1 1 1 1 1 1 2 0 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 2 1 1 1 1 1 1 1 0 1 3 1 1 1 3 1 1 1 1 1 1
 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 2 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1
 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 2
 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 3 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
 1 1 1 1 0 0 1 1 1 1 1 1 1 0 3 1 1 1 1]

Cluster scatter diagram:

15.K-means clustering effect evaluation criteria

K-means--sklearn clustering effect evaluation API:

● sklearn.metrics.silhouette_score(X,labels)

             ● calculate the average contour coefficient of all samples

             ● X: characteristic value

             ● labels: target value (predicted value) marked by clustering

  K-means summary:

Feature analysis: iterative algorithm is adopted, which is intuitive, easy to understand and very practical.

Disadvantages: it is easy to converge to the local optimal solution (multiple clustering)

Note: clustering is usually done before classification

Posted by davidz on Wed, 01 Dec 2021 15:44:24 -0800