[reading notes] only simple but not in-depth data analysis (10)

Keywords: calculator Session REST

This chapter is a continuation of the previous chapter. This chapter is mainly about regression problem. It designs a "salary increase calculator", which is a salary increase algorithm.

Here is still the data. First, let's make a histogram of the requirements and results of the salary raiser:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('./hfda_data/hfda_ch10_employees.csv')


plt.figure(1)
plt.hist(df[df.negotiated == True].iloc[::, 1], bins=np.arange(0, 25, 0.5),  histtype='bar', facecolor='blue',
         edgecolor='black', )

plt.figure(2)
plt.hist(df[df.negotiated == True].iloc[::, 2], bins=np.arange(0, 25, 0.5),  histtype='bar', facecolor='blue',
         edgecolor='black', )


plt.show()

We can see that the two pictures are very similar, but we didn't put the demand together with the real salary increase. The specific situation is unknown.

So let's use the scatter diagram to see the relationship between the two:

df = pd.read_csv('./hfda_data/hfda_ch10_employees.csv')


plt.figure(1)
plt.scatter(df[df.negotiated == True].iloc[::, 2],
            df[df.negotiated == True].iloc[::, 1], c='b', s=20, linewidths=0.5, marker='o', edgecolors='black')

plt.show()

Here we need linear regression. r used in the book, I use gradient descent to make one variable regression. (you can also try other libraries)

import pandas as pd
import tensorflow as tf
import numpy as np

df = pd.read_csv('./hfda_data/hfda_ch10_employees.csv')


X = np.array(df[df.negotiated == True].iloc[::, 2])[:, np.newaxis]
Y = np.array(df[df.negotiated == True].iloc[::, 1])[:, np.newaxis]

xs = tf.placeholder(tf.float32, [None, 1])
ys = tf.placeholder(tf.float32, [None, 1])

Weights = tf.Variable(tf.random_normal([1], -1, 1))
biases = tf.Variable(tf.zeros([1]) + 0.1)

Wx_b = xs*Weights + biases

loss = tf.reduce_mean(tf.reduce_sum((tf.square(ys - Wx_b)), reduction_indices=[1]))

train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss)

init = tf.global_variables_initializer()

sess = tf.Session()

sess.run(init)
for i in range(5000):
    sess.run(train_step, feed_dict={xs: X, ys: Y})
    if i % 50 == 0:
        print(i, sess.run(Weights), sess.run(biases))

The result in the book is 0.7x+2.3. My result is 0.72507244x+2.3120737, which is basically the same as that in the book.

Finally, we manually verify the following results:

 

df = pd.read_csv('./hfda_data/hfda_ch10_employees.csv')
df1 = df[df.negotiated==True]
print(df1.corr())

We can see that the correlation coefficient is 0.665648, and then we use:

df1.describe()

Find the variance of the two

Then r* std(y)/std(x) is used to calculate the slope. r here is the correlation coefficient, std(y) is the standard deviation of real wage growth, and std(x) is the standard deviation of wage growth.

 

Our painstakingly designed "pay calculator" didn't follow the plan, but this chapter is over here, and the rest of the questions will go to the next chapter.

Posted by kparish on Mon, 30 Dec 2019 23:10:32 -0800