1, Using excel to do linear regression
1. Startup steps
Find data analysis in the data column of excel, and then start linear regression. If there is no "data analysis" tab, you can select "add in" in the "file" tab and select "go" as follows:
Check the required tool
2. Data analysis
After importing the height and weight data set, collect the first 20 items for data analysis, and the following results can be obtained:
- Correlation coefficient (Multiple): 0.570
- P-value: 0.01 < p < 0.005, so the regression equation is established.
The linear regression equation is
Next, take a look at the first 200 items
- Multiple: 0.556
- P-value: 0.01 < p < 0.005, so the regression equation is established.
The linear regression equation is
The first two thousand groups of data:
- Multiple: 0.498
- P-value: 0.01 < p < 0.005, so the regression equation is established.
The linear regression equation is
It can be seen that when there is more data, the value of R^2 changes greatly
2, jupyter programming
1. Data import
After downloading the software, open
The following interface will appear
This interface is used to connect the web and your local machine. It cannot be closed during use.
The software will automatically open and import our data set (Note: click upload after import)
2. Least square coding without third-party library
Enter the following code in the new python file:
import pandas as pd import numpy as np import math #Prepare data p=pd.read_excel('weights_heights(height-Weight data set).xls','weights_heights') #Add according to your needs. Here, select the first 20 rows of data p1=p.head(20) x=p1["Height"] y=p1["Weight"] # average value x_mean = np.mean(x) y_mean = np.mean(y) #Total number of x (or y) columns (i.e. n) xsize = x.size zi=((x-x_mean)*(y-y_mean)).sum() mu=((x-x_mean)*(x-x_mean)).sum() n=((y-y_mean)*(y-y_mean)).sum() # Parameters a and B a = zi / mu b = y_mean - a * x_mean #Square of correlation coefficient R m=((zi/math.sqrt(mu*n))**2) # Here, 4 significant digits are reserved for parameters a = np.around(a,decimals=4) b = np.around(b,decimals=4) m = np.around(m,decimals=4) print(f'Regression line equation:y = {a}x +({b})') print(f'The correlation regression coefficient is{m}') #Draw the fitting curve with the help of the third-party library skleran y1 = a*x + b plt.scatter(x,y) plt.plot(x,y1,c='r')
The following are the test results
Top 20 data results
Top 200 data results
First 2000 data results
3. Use skleran coding
The code is as follows:
# Import required modules import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression p=pd.read_excel('weights_heights(height-Weight data set).xls','weights_heights') #Number of data rows read p1=p.head(20) x=p1["Height"] y=p1["Weight"] # data processing # The input and output of sklearn fitting are generally two-dimensional arrays. Here, one dimension is transformed into two dimensions. y = np.array(y).reshape(-1, 1) x = np.array(x).reshape(-1, 1) # fitting reg = LinearRegression() reg.fit(x,y) a = reg.coef_[0][0] # coefficient b = reg.intercept_[0] # intercept print('The fitted equation is: Y = %.4fX + (%.4f)' % (a, b)) c=reg.score(x,y) # correlation coefficient print(f'The correlation regression coefficient is%.4f'%c) # visualization prediction = reg.predict(y) plt.xlabel('height') plt.ylabel('weight') plt.scatter(x,y) y1 = a*x + b plt.plot(x,y1,c='r')
Top 20 groups of data
First 200 groups of data
First 2000 groups of data
3, Summary
The results of the three methods are similar. Although excel is convenient and easy, using jupyter is more helpful for us to understand the internal algorithm of machine learning and the related knowledge of linear regression.