excel linear regression and jupyter programming

1, Using excel to do linear regression

1. Startup steps

Find data analysis in the data column of excel, and then start linear regression. If there is no "data analysis" tab, you can select "add in" in the "file" tab and select "go" as follows:

Check the required tool

2. Data analysis

After importing the height and weight data set, collect the first 20 items for data analysis, and the following results can be obtained:

Correlation coefficient (Multiple): 0.570
P-value: 0.01 < p < 0.005, so the regression equation is established.
The linear regression equation is

Next, take a look at the first 200 items
Multiple: 0.556
P-value: 0.01 < p < 0.005, so the regression equation is established.
The linear regression equation is

The first two thousand groups of data:
Multiple: 0.498
P-value: 0.01 < p < 0.005, so the regression equation is established.
The linear regression equation is

It can be seen that when there is more data, the value of R^2 changes greatly

2, jupyter programming

1. Data import

After downloading the software, open

The following interface will appear

This interface is used to connect the web and your local machine. It cannot be closed during use.
The software will automatically open and import our data set (Note: click upload after import)

2. Least square coding without third-party library

Enter the following code in the new python file:

import pandas as pd
import numpy as np
import math
#Prepare data
p=pd.read_excel('weights_heights(height-Weight data set).xls','weights_heights')
#Add according to your needs. Here, select the first 20 rows of data
p1=p.head(20)
x=p1["Height"]
y=p1["Weight"]
# average value
x_mean = np.mean(x)
y_mean = np.mean(y)
#Total number of x (or y) columns (i.e. n)
xsize = x.size
zi=((x-x_mean)*(y-y_mean)).sum()
mu=((x-x_mean)*(x-x_mean)).sum()
n=((y-y_mean)*(y-y_mean)).sum()
# Parameters a and B
a = zi / mu
b = y_mean - a * x_mean
#Square of correlation coefficient R
m=((zi/math.sqrt(mu*n))**2)
# Here, 4 significant digits are reserved for parameters
a = np.around(a,decimals=4)
b = np.around(b,decimals=4)
m = np.around(m,decimals=4)
print(f'Regression line equation:y = {a}x +({b})') 
print(f'The correlation regression coefficient is{m}')
#Draw the fitting curve with the help of the third-party library skleran
y1 = a*x + b
plt.scatter(x,y)
plt.plot(x,y1,c='r')

The following are the test results
Top 20 data results

Top 200 data results

First 2000 data results

3. Use skleran coding

The code is as follows:

# Import required modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

p=pd.read_excel('weights_heights(height-Weight data set).xls','weights_heights')
#Number of data rows read
p1=p.head(20)
x=p1["Height"]
y=p1["Weight"]
# data processing
# The input and output of sklearn fitting are generally two-dimensional arrays. Here, one dimension is transformed into two dimensions.
y = np.array(y).reshape(-1, 1)
x = np.array(x).reshape(-1, 1)
# fitting
reg = LinearRegression()
reg.fit(x,y)
a = reg.coef_[0][0]     # coefficient
b = reg.intercept_[0]   # intercept
print('The fitted equation is: Y = %.4fX + (%.4f)' % (a, b))
c=reg.score(x,y)    # correlation coefficient
print(f'The correlation regression coefficient is%.4f'%c)

# visualization
prediction = reg.predict(y)                
plt.xlabel('height')
plt.ylabel('weight')
plt.scatter(x,y)
y1 = a*x + b
plt.plot(x,y1,c='r')

Top 20 groups of data

First 200 groups of data

First 2000 groups of data

3, Summary

The results of the three methods are similar. Although excel is convenient and easy, using jupyter is more helpful for us to understand the internal algorithm of machine learning and the related knowledge of linear regression.

Posted by ronniebrown on Fri, 01 Oct 2021 11:02:52 -0700

Programmer Group