A detailed explanation of Python 3's method of visual analysis of pull hook data

Keywords: Python encoding Attribute Windows

This article mainly introduces the relevant materials about the visual analysis of the pull hook data of Python 3. The example code is introduced in detail in this article, which has certain reference value for you to learn or use Python 3. The friends who need to learn it will come to learn together
Preface

Last time we talked about how to grab the data of the tick. Since we have got the data, don't leave it alone. Take it out and analyze it to see what information is contained in the data.
Let's have a look at the detailed introduction

Article directory

1, Preliminary preparation
2, Pretreatment
3, Visual analysis
4, Achievements and summary] (https://jq.qq.com/? ﹐ WV = 1027 & K = 5jijrvv)

1, Preliminary preparation

Because the data we grabbed last time contains information such as ID, we need to remove it and check the descriptive statistics to confirm whether there is an abnormal value or a true value.

read_file = "analyst.csv"
# Read file to get data
data = pd.read_csv(read_file, encoding="gbk")
# Remove irrelevant columns from data
data = data[:].drop(['ID'], axis=1)
# descriptive statistics 
data.describe()

The unique value in the result indicates the number of different values under the attribute column. For example, the education requirement includes four different values [undergraduate, junior college, master, unlimited]. The top value indicates the maximum number of values [undergraduate], and the freq value is 387. Since there are many unique salaries, let's look at the values.

print(data['Academic requirements'].unique())
print(data['Hands-on background'].unique())
print(data['salary'].unique())

2, Pretreatment

It can be seen from the above two figures that the education requirements and work experience values are relatively small and there are no missing values and abnormal values, which can be analyzed directly; however, there are more than 75 kinds of salary distribution, in order to better analyze, we need to do a preprocessing of salary. According to its distribution, it can be divided into [5K below, 5k-10k, 10k-20k, 20k-30k, 30k-40k, 40K above], in order to facilitate our analysis, we take the median of each salary range and divide it into the range we specify.

# Preprocessing salary
def pre_salary(data):
 salarys = data['salary'].values
 salary_dic = {}
 for salary in salarys:
 # Split according to '-' and remove 'k', and convert the values at both ends to integers respectively
 min_sa = int(salary.split('-')[0][:-1])
 max_sa = int(salary.split('-')[1][:-1])
 # Median
 median_sa = (min_sa + max_sa) / 2
 # Judge its value and divide it into specified range
 if median_sa < 5:
 salary_dic[u'5k Following'] = salary_dic.get(u'5k Following', 0) + 1
 elif median_sa > 5 and median_sa < 10:
 salary_dic[u'5k-10k'] = salary_dic.get(u'5k-10k', 0) + 1
 elif median_sa > 10 and median_sa < 20:
 salary_dic[u'10k-20k'] = salary_dic.get(u'10k-20k', 0) + 1
 elif median_sa > 20 and median_sa < 30:
 salary_dic[u'20k-30k'] = salary_dic.get(u'20k-30k', 0) + 1
 elif median_sa > 30 and median_sa < 40:
 salary_dic[u'30k-40k'] = salary_dic.get(u'30k-40k', 0) + 1
 else:
 salary_dic[u'40 Above'] = salary_dic.get(u'40 Above', 0) + 1
 print(salary_dic)
 return salary_dic

After preprocessing salary, preprocess the text of employment requirements. In order to make a cloud of words, we need to segment the text and remove some words that appear frequently but have no meaning. We call them stop words, so we use the jieba library to process them. jieba is a word segmentation library implemented by python, which has a strong word segmentation ability for Chinese.

import jieba
def cut_text(text):
 stopwords =['be familiar with','technology','position','Relevant','work','Development','Use','ability',
 'first','describe','Serving','experience','Experienced person','Have','Have','Above','be good at',
 'one kind','as well as','Certain','Conduct','Can','We']
 for stopword in stopwords:
 jieba.del_word(stopword)
  
 words = jieba.lcut(text)
 content = " ".join(words)
 return content

After the preprocessing, the visual analysis can be carried out.

3, Visual analysis

Let's draw the ring chart and the bar chart first, and then pass the data in. The code of the ring chart is as follows

def draw_pie(dic):
 labels = []
 count = []
  
 for key, value in dic.items():
 labels.append(key)
 count.append(value)
  
 fig, ax = plt.subplots(figsize=(8, 6), subplot_kw=dict(aspect="equal"))
 
 # Draw a pie chart, and wedge props represent the width of each sector
 wedges, texts = ax.pie(count, wedgeprops=dict(width=0.5), startangle=0)
 # Text box settings
 bbox_props = dict(boxstyle="square,pad=0.9", fc="w", ec="k", lw=0)
 # Line and arrow settings
 kw = dict(xycoords='data', textcoords='data', arrowprops=dict(arrowstyle="-"),
 bbox=bbox_props, zorder=0, va="center")
 
 for i, p in enumerate(wedges):
 ang = (p.theta2 - p.theta1)/2. + p.theta1
 y = np.sin(np.deg2rad(ang))
 x = np.cos(np.deg2rad(ang))
 # Set which side of the fan the text box is on
 horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]
 # Used to set the bending degree of the arrow
 connectionstyle = "angle,angleA=0,angleB={}".format(ang)
 kw["arrowprops"].update({"connectionstyle": connectionstyle})
 # annotate() is used to annotate the drawn figure. Text is the annotation text, and the parameter containing 'xy' is related to the coordinate point
 text = labels[i] + ": " + str('%.2f' %((count[i])/sum(count)*100)) + "%"
 ax.annotate(text, size=13, xy=(x, y), xytext=(1.35*np.sign(x), 1.4*y),
  horizontalalignment=horizontalalignment, **kw)
 plt.show()

The code of the histogram is as follows:

def draw_workYear(data):
 workyears = list(data[u'Hands-on background'].values)
 wy_dic = {}
 labels = []
 count = []
 # Get the number of work experience and save it to count
 for workyear in workyears:
 wy_dic[workyear] = wy_dic.get(workyear, 0) + 1
 print(wy_dic)
 # wy_series = pd.Series(wy_dic)
 # Get the key and value of count respectively
 for key, value in wy_dic.items():
 labels.append(key)
 count.append(value)
 # Generate an array of keys
 x = np.arange(len(labels)) + 1
 # Convert values to arrays
 y = np.array(count)
  
 fig, axes = plt.subplots(figsize=(10, 8))
 axes.bar(x, y, color="#1195d0")
 plt.xticks(x, labels, size=13, rotation=0)
 plt.xlabel(u'Hands-on background', fontsize=15)
 plt.ylabel(u'Number', fontsize=15)
  
 # Mark the numbers in the figure according to the coordinates, ha and va are the alignment methods
 for a, b in zip(x, y):
 plt.text(a, b+1, '%.0f' % b, ha='center', va='bottom', fontsize=12)
 plt.show()

Let's turn the data of education requirements and salary into a dictionary form, and pass it into the ring chart function. In addition, we also need to visualize the text of [job requirements].

from wordcloud import WordCloud
# Draw word cloud
def draw_wordcloud(content):
  
 wc = WordCloud(
 font_path = 'c:\\Windows\Fonts\msyh.ttf',
 background_color = 'white',
 max_font_size=150, # Font maximum
 min_font_size=24, # Font min
 random_state=800, # random number
 collocations=False, # Avoid repeating words
 width=1600,height=1200,margin=35, # Image width height, word spacing
 )
 wc.generate(content)
 
 plt.figure(dpi=160) # Zoom in or out
 plt.imshow(wc, interpolation='catrom',vmax=1000)
 plt.axis("off") # Hidden coordinates

Recommend our python learning button qun: 913066266, and see how the seniors learn! From the basic python script to web development, crawler, django, data mining and so on [PDF, actual source], the data from zero base to project actual combat have been sorted out. To everyone in python! Every day, Daniel regularly explains python technology, shares some learning methods and small details that need attention, and click to join our [python learner gathering place]

4, Achievements and summary] (https://jq.qq.com/? ﹐ WV = 1027 & K = 5jijrvv)

Most of the education requirements for python data analysts are undergraduate, accounting for 86%.
　
As can be seen from the histogram, most of the work experience of python data analysts requires 1-5 years.
From this, it can be concluded that there are more salaries of 10k-20k in python data analysis, and there are many salaries above 40. It is estimated that the requirements for high salaries will be higher, so let's take a look at the job requirements.
　　
It can be seen from the word cloud chart that data analysis must be sensitive to data, and also have certain requirements for statistics, excel, python, data mining, hadoop, etc. Not only that, but also requires a certain degree of pressure resistance, problem-solving ability, good expression ability, thinking ability and so on.