Analysis report of carefree recruitment information data

Keywords: Data Analysis

1. Question Settings

1.   Data cleaning for column G is classified as None, 2000-3999, 4000-5999, 6000-7999, 8000-9999, and more than 10,000.

2.   Data cleaning for AA columns requires three categories: Photoshop, CorelDRAW,AI; Contains one or two of them; Not at all.

3.   The job applicant information can be classified into 4-5 categories according to the degree of matching and try to interpret the classification results verbally according to the data in columns G, AA, E and X after cleaning.

2. Getting data

2.1 Importing data

# Import data
import pandas as pd
df = pd.read_excel('data.xlsx')

  2.2 Viewing data

# View data types and total data for each column
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1430 entries, 0 to 1429
Data columns (total 27 columns):
Resume number 1430 non-null int64
 Applicant position 1430 non-null object
 Gender 1430 non-null object
 Birth date 1430 non-null datetime64[ns]
Age 1430 non-null int64
 Residence 1430 non-null object
 Expected salary 1407 non-null object
 Location 1409 non-null object
 Function/position 1430 non-null object
 Industry 1149 non-null object
 Working type 1430 non-null object
 School 1430 non-null object
 Type 1430 non-null object
 Professional 1420 non-null object
 Working experience 1348 non-null object
 Working time 1430 non-null object
 Company name 1430 non-null object
 Company size 1430 non-null object
 Company Industry 1430 non-null object
 Property 1430 non-null object
 Department 1430 non-null object
 Position 1430 non-null object
 Number of working companies 1430 non-null int64
 Working life 1430 non-null object s
 Working stability 1430 non-null int64
 Is 1430 non-null object stable
 Skill/Language 578 non-null object
dtypes: datetime64[ns](1), int64(4), object(22)
memory usage: 301.7+ KB

The results of the query show that there are 1430 rows and 27 fields, of which 6 are missing.

3. Data Processing

3.1 Question 1:

Data cleaning for column G is classified as None, 2000-3999, 4000-5999, 6000-7999, 8000-9999, and more than 10,000.

3.1.1 Duplicate Value Processing

# Remove duplicate values
print('Size before deleting duplicate values',df.shape)

# Delete duplicate records
df = df.drop_duplicates()

print('Size after deleting duplicate values',df.shape)

 

Indicates that there are no duplicate values for the dataset.

3.1.2 Missing Value Handling

According to df.info() query result: expected salary is missing 1430-1407=23, missing rate 23/1430=1.6%

Column G does not meet the criteria for Nan and None, so fill in column G Nan with'None', so there will be no missing values.

new_data = df.fillna({'Salary Expectation':'nothing'})
new_data['Salary Expectation']

3.1.2 Unification Processing

First look at each unit type:

# Convert salary formats from different units into a unified format
print(new_data['Salary Expectation'].str[-6:].value_counts())

Query found that there are unit types: yuan/month, ten thousand yuan/year, none, less than ten thousand yuan/year, less than ten thousand yuan/month

 












print(new_data['Salary Expectation'].str[-10:].value_counts())
# Convert irregular salary formats to uniform formats
new_data.loc[new_data['Salary Expectation'].str[-6:]=='Ten thousand yuan or less/year','Salary Expectation']= '0-2 Ten thousand yuan/year'
new_data.loc[new_data['Salary Expectation'].str[-5:]=='The following Yuan/month','Salary Expectation']='0-1500 element/month'

Unit into four units: yuan/month, yuan/month, yuan/year, none

# Unify the above four units into an average monthly salary (Yuan/Month)
import re
def salary_p(salary):
    if salary == 'nothing':
        return 0
    elif re.search('element/month',salary):
        s = re.findall('(\d+)-(\d+)element/month',salary)
        # s=[('1000','10000')]
        mi,ma = int(s[0][0]),int(s[0][1])
        return (mi+ma)*0.5
    elif re.search('Ten thousand yuan/year',salary):
        s = re.findall('(\d+)-(\d+)Ten thousand yuan/year',salary)
        mi,ma = int(s[0][0]),int(s[0][1])
        return (mi+ma)*0.5*10000/12
    elif re.search('element/year',salary):
        s = re.findall('(\d+)-(\d+)element/year',salary)
        mi,ma = int(s[0][0]),int(s[0][1])
        return (mi+ma)*0.5/12
    else:
        return -1
# New column saves average monthly pay
new_data['mean_salary']  = new_data['Salary Expectation'].apply(salary_p)

# Make a slice
bins = [0,2000,4000, 6000, 8000,10000,1000000]
labels =['0-2000','2000-3999','4000-5999','6000-7999','8000-9999','>10000']
new_data['salary_class']=pd.cut (new_data['mean_salary'], bins, labels=labels, right=False)
new_data['salary_class'].value_counts()

The results are as follows:

 

 

3.2   Question 2:

Data cleaning for AA columns requires three categories: Photoshop, CorelDRAW,AI; Contains one or two of them; Not at all.

3.2.1 Missing Value Handling

new_data['Skill/language'].fillna('nothing',inplace=True)
new_data.info()

3.2.2   Skill keyword processing

new_data['skill'] = new_data['Skill/language'].str.findall('[a-zA-Z0-9\s#.\+/]+')

# Handle spaces before and after skill words and convert them to lowercase
new_data['skill2']=''
for i in new_data['skill'].index:
    new_data.loc[i,'skill2']=",".join(map(lambda x:x.strip().lower(),new_data.loc[i,'skill']))

3.2.3   Skill Statistics

new_data['skill_class']=0
s = ['photoshop','corel draw','ai']
for i in new_data['skill2'].index:
    skill = new_data['skill2'].iloc[i]
    b = 0
    for j in s:
        if j in skill:
            b+=1
    new_data.loc[i,'skill_class']=b
 
new_data['skill_class'].value_counts()

new_data['skill_class'].map({0:'Include None',1:'Include one or two of them',2:'Include one or both of them',3:'All three skills are included'})

The results are as follows:

 

3.2.4 Visual Display

import matplotlib.pyplot as plt

#Define the label for the pie chart, the label is a list
labels =[ '0','1 Or 2','3']

x = [1120,307,3]
explode = (0.3,0,0.1)

# Draw pie chart
plt.pie(x,labels=labels,autopct='%.0f%%',explode=explode)
plt.axis('equal')
plt.show()

 

  The results show that 78% of them are not included at all, 21% are included at 1 or 2, and 0% are all included at all.

3.3   Question 3:

The job applicant information can be classified into 4-5 categories according to the degree of matching and try to interpret the classification results verbally according to the data in columns G, AA, E and X after cleaning.

3.3.1   Data cleaning

G/AA/E, numeric type after cleaning, column X needs cleaning

def work_proc(work_y):

    if work_y == 'No work experience yet':
        return 0
    elif re.search('year',work_y):
        s = re.findall('(\d+)year',work_y)
        # s=['8'], converting data types
        s=int(s[0])
        return s

new_data['work_years'] = new_data['Work life'].apply(work_proc)

new_data['work_years'] 
# Wages are replaced by average pay
new_data = new_data.rename(columns={'Age':'age'})
new_data2 = new_data[['work_years','age','mean_salary','skill_class']] 
new_data2

3.3.2   Modeling Analysis

# Introducing kmeans algorithm model
from sklearn.cluster import KMeans

# Call KMeans algorithm for clustering, set to 4 categories
clf = KMeans(n_clusters=4,random_state=0)

# Fitting the data, training the model
clf.fit(new_data2)
# View classification results
print(len(clf.labels_))

se = pd.Series(clf.labels_)
print(se.value_counts())

The results are as follows:

 

3.2.3   Visual Display

new_data2['label']=clf.labels_
new_data2.head()

  Column chart:

xl_cnt2 = new_data2.groupby('label')['label'].count()
xl_cnt2 = xl_cnt2.sort_values()
fig = plt.figure(figsize=(12,5))
plt.bar(xl_cnt2.index,xl_cnt2.values)
plt.xlabel('clusters')
plt.xticks(rotation=90)
plt.title('K-means')

  Sector:

import matplotlib.pyplot as plt


labels =[ '0','1','2','3']

x = [818,504,3,105]
explode = (0.3,0,0,0)

plt.pie(x,labels=labels,autopct='%.0f%%',explode=explode)
plt.axis('equal')
plt.show()

The results show that:

Matching degree 0 is 818, accounting for 57%;

Matching degree 1 is 505, accounting for 35%;

Matching degree 2 accounts for about 0% of the total.

Matching degree 3 of 105 accounts for about 7%.

 

Posted by webren on Wed, 17 Nov 2021 10:53:06 -0800