Python User churn data mining: establish logistic regression, XGboost, random forest, decision tree, support vector machine, naive Bayesian model and Kmeans user portrait

Keywords: Machine Learning AI Data Mining

Original link: 

1.1 project background:

In today's highly homogeneous brand marketing stage, the competition between enterprises is mainly reflected in the competition for customers. "User is God" urges many enterprises to compete for as many customers as possible at any cost. However, in the process of developing new users at all costs, enterprises often ignore or have no time to take into account the loss of existing customers, which leads to such a dilemma: on the one hand, new customers are increasing, on the other hand, hard-earned customers are quietly losing. Therefore, it is very important and urgent to analyze the loss of old users and mine important information to help enterprise decision-makers take measures to reduce the loss of users.

1.2 purpose:

Deeply understand user portraits and behavior preferences, mine the key factors affecting user churn, and predict the transformation results of customer visits through algorithms, so as to better improve product design and improve user experience.

1.3 Data Description:

This data is the access data of Ctrip users for a week. In order to protect customer privacy, the data has been desensitized, and there is some gap with the order volume, browsing volume and conversion rate of actual goods, which does not affect the solvability of the problem.

2 read data


#  Show all features

  3 segmentation data

#  Partition training set
X\_train, X\_test, y\_train, y\_test = train\_test\_split(X, y, test\_size=0.2, random\_state=666)

3.1 understanding data

You can see that there are many variables. First, classify them. Excluding the target variable label, the fields of this dataset can be divided into three categories: order related indicators, customer behavior related indicators and hotel related indicators.

4 characteristic Engineering


#  Data exploration with training set
train = pd.concat(\[X\_train,y\_train\],axis=1)

4.1 data preprocessing

4.1.1 delete unnecessary columns


4.1.2 data type conversion

String type features need to be processed into numerical type before modeling. Subtract arrival and d to get "scheduled days in advance" as a new feature

#  Add column
#  Converts two date variables from a string to a date format type
train\["arrial"\] = pd.to_datimetain\["arrval"\])
X\_tst\["arival"\] =\_daetime(X_est\["arival"\])
#  Generate advance scheduled time column (derived variable)
X\_trin\["day\_adanced"\] = (X_rain\["arival"\]-Xtrain\["d"\]).dt.days

##  Delete column

4.1.3 variables with missing values generate an indicating dummy variable


zsl = tain.isnll().sum()\[tain.isnll(.sum()!=0\].inex

4.1.4 fill vacancies based on business experience  

ordernum_ The number of annual orders of oneyear users is 0, 11% of lasttlordergap is filled with 600000, 88% is filled with 600000, the time from the last order is filled with 600000, ordercancelednum is filled with 0, and ordercanceledpriority is filled with 0t
The single rate 2421414 - 242114 - is empty. There are two cases: 1. The new user has not placed an order - 88.42% 214097 2. The old user has not consumed for more than 1 year, and the code column is added for the new user who has not placed an order and the old user who has not placed an order for 1 year
price\_sensitive -0, median filled price sensitive index, consuming\_capacity -0 median filling consumption capacity index 226108 - empty 1. New users who have never placed an order 214097 2.12011 are empty. The reason is not clear
uv\_pre -24-hour history most viewed Hotel history uv. cr\_pre -0, median filling - 24-hour history, the most visited Hotel history cr -0, median filling 29397 - empty 1. The user did not log in to APP 28633 that day 2. 178 586 of the new hotel just launched had no UV, cr record code added 764 29397 of the new hotel just launched on the APP
customereval_pre2 fill in 0 - average score of 24-hour historical browsing hotel customers, landalfhours - length of login within 24 hours - fill in 28633 with 0 - empty: the user did not log in to APP 28633 that day
Hotelcr, hoteluv - median fill 797

Just went to the new hotel 60 # but didn't log in to APP 118

avgprice 0 fills in part of the price, fills in 0 the number of people who have not placed an order in recent years, and cr fills in 0,

tkq = \["hstoryvsit\_7ordernm","historyviit\_visit\_detaipagenum","frstorder\_b","historyvi
# tbkq = \["hitoryvsit\_7dernum","hisryvisit\_isit_detailagenum"\]

##  Some are filled with 0 and some with median
#  Related attributes affected by new users: ic\_sniti,cosuing\_cacity
n\_l = picesensitive","onsmng\_cpacty"\]
fori in n_l

4.1.5 abnormal value handling

customer\_value\_profit,ctrip_ Negative values in profiles are treated as 0
Set delta\_price1,delta\_ Negative values in price2 and lowestprice are treated as median

for f in flter_two:
a = X_trin\[\].median()

4.1.6 missing value filling

Fields that tend to be normally distributed are filled with the mean value: businessrate\_pre2,cancelrate\_pre,businessrate_pre; The field of skew distribution is filled with the median

def na_ill(df):
for col in df.clumns:
mean = X_trai\[col\].mean()

##  Derivative variable annual transaction rate
X_al =pd.nca(\[Xtin,Xtes)
#Decision tree test

dt = Decsionr(random_state=666)

pre= dt.prdict(X_test)
pre\_rob = dt.preicproa(X\_test)\[:,1\]

4.2 data standardization


scaler = MinMacaer()

#Decision tree test
dt = DeonTreasifi(random_state=666)

5 feature screening

5.1 feature selection - delete 30% columns

X\_test = X\_test.iloc\[:,sp.get_spport()\]
#Decision tree test
dt = DecisonreeClssifie(random_state=666)\_trin,y\_tain)
pre = dt.pdict(X_test)
pe\_rob = dt.redicproba(X\_test)\[:,1\]


  5.2 collinearity / data correlation


#Collinearity -- severe collinearity above 0.9, merge or delete
d = Xtrai.crr()
d\[d<0.9\]=0 #The assignment shows highly relevant variables

6 modeling and model evaluation

6.1 logistic regression


y\_prob = lr.preictproba(X\_test)\[:,1\]
y\_pred = lr.predict(X\_test
fpr\_lr,pr\_lr,teshold\_lr = metris.roc\_curve(y\_test,y\_prob)
ac\_lr = metrcs.aucfpr\_lr,tpr_lr)
score\_lr = metrics.accuracy\_score(y\_est,y\_pred)
prnt("The accuracy of the model is:{0},AUC Score{1}".fomat(score\_lr,auc\_lr))

6.2 naive Bayes


gnb = GasinNB() #  Instantiate an LR model,ytran) #  Training model
y\_prob = gn.pic\_proba(X_test)\[:,1\] #  Probability of predicting class 1
y\_pred = gnb.preict(X\_est) #  Prediction results of the model on the test set
fpr\_gnb,tprgnbtreshold\_gb = metrics.roc\_crve(ytesty\_pob) #  Obtain true positive rate, false positive rate and threshold
aucgnb = meic.aucf\_gnb,tr\_gnb) #  AUC score
scoe\_gnb = merics.acuray\_score(y\_tes,y\_pred) #  Model accuracy

6.3 support vector machine


s =SVkernel='f',C=,max_ter=10,randomstate=66).fit(Xtrain,ytrain)
y\_rob = sc.decsion\_untio(X_st) #  Decision boundary distance
y\_ed =vc.redit(X\_test) #  Prediction results of the model on the test set
fpr\_sv,tpr\_vc,theshld\_sv = mtris.rc\_urv(y\_esty\_pob) #  Obtain true positive rate, false positive rate and threshold
au\_vc =\_sc,tpr_sv) #  Model accuracy
scre\_sv = metrics.ccuracy\_sore(_tst,ypre)

  6.4 decision tree\_tran,\_raiproba(X_test)\[:,1\] #  Probability of predicting class 1
y\_pred = dtc.predct(X\_test #  Prediction results of the model on the test set
fpr\_dtc,pr\_dtc,thresod\_dtc= metrcs.roc\_curvey_test,yprob) #  Obtain true positive rate, false positive rate and threshold

6.5 random forest


c = RndoForetlassiir(rand_stat=666) #  Establish random Sen,ytrain) #  Training random forest model
y\_rob = rfc.redict\_poa(X_test)\[:,1\] #  Probability of predicting class 1
y\_pedf.pedic(\_test) #  Prediction results of the model on the test set
fpr\_rfc,tp\_rfc,hreshol\_rfc = metrcs.roc\_curve(y\_test,\_prob) #  Obtain true positive rate, false positive rate and threshold
au\_fc = meris.auc(pr\_rfctpr_fc) #  AUC score
scre\_rf = metrcs.accurac\_scor(y\_tes,y\_ped) #  Model accuracy

6.6 XGboost


#  Read training data set and test set
#  Set xgboost modeling parameters
paras{'booser':'gbtee','objective': 'binay:ogistic','evlmetric': 'auc'

#  Training model
watchlst = (dtain,'trai)
#  Enter the probability value that the prediction is positive
#  Set the threshold to 0.5 to get the prediction results of the test set
y\_pred = (y\_prob >= 0.5)*1
#  Obtain true positive rate, false positive rate and threshold
fpr\_xg,tpr\_xgb,heshold\_xgb = metricsroc\_curv(test,y_prob)
aucxgb= metics.uc(fpr\_gb,tpr\_xgb #  AUC score
score\_gb = metricsaccurac\_sore(y\_test,y\_pred) #  Model accuracy
print('The accuracy of the model is:{0},AUC Score:{1}'.format(score\_xgb,auc\_xgb))

  6.7 model comparison


plt.xlabel('False positive rate')
plt.ylabel('True Yang rate')
plt.title('ROC curve')
plt.savefig('Model comparison diagram.jpg',dpi=400, bbox_inches='tight')

6.8 important features

ea = pd.Sries(dct(list((X\_trclumsfc.eatre\_imortancs_))))
fea\_s = (\_vauesacnding=alse)).idex


6.9 loss cause analysis

  •   When the cityuvs and cityorders values are small, the loss of users is significantly higher than the average level, indicating the lack of hotel information in small cities on Ctrip platform, and users turn to competitive products with complete hotel information in small cities, resulting in the loss of users
  • When the access time is from 7:00 to 19:00, the proportion of user loss is high and average: weekday push should avoid these time points
  • The hotel business attribute index is in the range of 0.3-0.9, and the loss of users is greater than the average level, and shows an increasing trend, indicating that there is a gap between the expectations of Hotels with high platform business index and users (too high price or other reasons?), and there is less loss of users with low business attribute
  • Within a year, the shorter the time from the last order, the more serious the loss. Due to the negative news broke out by Ctrip from May 2015 to January 2016, enterprises should strengthen their own management and establish a good social image
  • The loss of users with low consumption capacity index (10-40) is serious, and this part of users accounts for 50%, which should be paid attention to
  • The loss of people with price sensitivity index (5-25) is higher than the average level, and pay attention to hotel quality
  • The higher the user conversion rate, the number of users' annual orders and the number of users' historical orders in recent 1 year, the greater the proportion of people who do not visit the order filling page within 24 hours, and the more serious the loss. It is necessary to make a good tracking experience for users after placing an order, invite them to fill in the check-in experience, sort out their opinions and make improvements
  • The shorter the advance booking days, the more serious the loss is. The higher the number of orders cancelled by users in a year, the more serious the loss is

6.10 suggestions:

  • Consider taking the market share of the third and fourth tier cities and low-end hotels
  • Users are vulnerable to the negative news of the enterprise. It is suggested that the enterprise dare to bear social responsibility, strengthen its own management, improve the timeliness of Public Relations News processing, and establish a good image of the brand
  • Start the hotel push of popular scenic spots 2-3 weeks before the holiday
  • Do a good job in the tracking experience after the hotel orders, invite to fill in the check-in experience, and sort out the user's comments for improvement

7 customer portrait

7.1 modeling user classification

#  User portrait features
user\_feature = \["decisiohabit\_user,'starprefer','lastpvgap','sid',
#  Loss impact characteristics
fea_lis = \["cityuvs",

#  Data standardization   Kmeans method is better for normal distribution data processing
scaler = StanardScalr()
lo\_atribues = pdDatarame(scr.fittransfrm(all\_cte),columns=all_ce.coluns)

#  Modeling classification
Kmens=Means(n\_cluste=2,rndom\_state=0) #333 #  Training model
k\_char=Kmenscluster\_centers_ #  Get each category

7.2 proportion of user types


types=\['High value user','Potential users'\]
ax.pie\[1\], raius=0.,colors='w')
plt.savefig(User portrait.jpg'dpi=400, box_inchs='tigh')

  7.3 high value user analysis:

Accounting for 19.02, the frequency of visits and reservations are high, the consumption level is high, the customer value is high, the pursuit of high quality, the requirements for hotel stars are high, and the customer group is mostly concentrated in old customers,
Recommend more business hotel chains with good reputation and high cost performance to attract users;
Push messages at 11:00, 17:00 and other small peak hours of daytime traffic on non working days.
Provide customers with more travel hotel information;
Increase the cost of customer churn: member point system and launch member discount card

7.4 potential user analysis

Proportion: 80.98% visit frequency and reservation frequency are low, consumption level is low, hotel star rating requirements are not high, customer groups are mostly concentrated in new customers, and customer value needs to be explored. Suggestions:
Because most new users are potential customers, it is recommended to grasp the user's initial experience (such as preferential initial consumption, punch in activities, etc.) and regularly push affordable hotels to such users, so as to cultivate the user's consumption inertia;
The content of push should mostly be big price reduction, big reward, jumping price and so on;
Because this part of users account for a large number, we can analyze the factors of losing customers in combination with the loss of this group, explore the market of this group, further analyze the sinking and explore new time.

Most popular insights

1.Why employees leave from decision tree model

2.R language tree based method: decision tree, random forest

3.Using scikit learn and pandas decision trees in python

4.Machine learning: running random forest data analysis reports in SAS

5.R language uses random forest and text mining to improve airline customer satisfaction

6.Machine learning boosts fast fashion and accurate sales time series

7.Recognition of changing stock market conditions by machine learning -- Application of hidden Markov model

8.python machine learning: implementation of recommendation system (collaborative filtering by matrix decomposition)

9.Predicting bank customer churn using python machine learning classification in python

Posted by jds580s on Wed, 17 Nov 2021 00:26:02 -0800