04 data analysis and mining xgboost&git

Keywords: git Data Analysis

Data analysis and mining xgboost&git

git address


Upload files to GitHub for the first time

  • Enter the managed folder

  • Open git bash

  • Initialization command

    git init
  • View file status in directory

    git status
    # New or modified files are red
  • Manage specified files (red to green)

    git add file name
    git add .      # All documents
  • Personal information configuration: user name, mailbox (only once)

    git config --global user.email "you@example.com"
    git config --global user.name "your name"
  • Build version

    git commit -m 'Description information'
  • View version record

    git log
  • Connect GitHub

    # Now you need to verify with token and ssh
    # Alias remote warehouse
    git remote add origin Remote warehouse address
    # Push code to remote warehouse (main branch by default)
    git push -u origin branch

I have uploaded the first version. You just clone my version to get a folder, create a new folder in this folder, complete your tasks and avoid conflicts

Download the code on your computer

# Clone remote warehouse code
git clone Remote warehouse address (no initialization at this time)
# Use mirror
git clone https://github.com.cnpmjs.org/name/project.git
# Put your completed tasks in a folder
# After completing the task
# My code already has two branches, which are cloned. By default, it is in the main branch. We need to switch branches and enter the second branch. My second branch is the dev branch
git checkout dev
git status
git add .
git commit -m "Description information"
# After completing the task, push the code to the remote end
# Note that you only need to push the dev branch to see which branch you are in
git push origin dev  # Only after collaboration can it be uploaded to other people's GitHub

Design of predicting user churn

Application scenario

Application scenario: the business department wants the data Department to analyze the lost users
 For example, what are the most significant characteristics of lost customers? Under what characteristics and conditions, users are more likely to lose behavior and give it to the business department
 Business optimization and recovery actions for these users.
Based on the above requirements: characteristics of data analysis:
This is an analysis of feature extraction. The target deliverables are the importance of features and feature rules
 Decision tree is the best algorithm to interpret rules
 Business departments need to understand the relationship between rules and provide rule diagrams
 There will be sample distribution imbalance in the data probability, because there must be a small number of lost users

Pandas                python      Do data reading and basic processing
sklearn of trian_test_split        The segmentation data set is divided into training set and test set
XGBoost                          Classification algorithm is used for model training and rule processing
sklearn.metrics in            Multiple evaluation indicators XGBoost Effect of model
imblearn.over_sampling              Medium SMOTE Sample equalization in the library
matplotlib                          Graphics output, used with
prettytable                         Table format output display
GraphViz                            The third-party program for vector graph output is python To provide an interface, you need to download and configure environment variables
pydot                               XGBoost Used when displaying tree graphics

Install the necessary libraries


Ensemble learning algorithm
	bagging(Random forest)
    	gbdt Gradient lifting algorithm, xgboost belong to gdbt algorithm

xgboost:Addition model and forward optimization algorithm (it is an integrated learning algorithm of regression tree and classification tree)  	(Multiple weak learners integrated together)
# Download the installation package of the corresponding python version from the website (my version is 3.7, and the installation package has been put into the folder)
# Put the downloaded installation package into the corresponding Python interpreter scripts, such as G:\python_learn\venv\Scripts
# cmd enter Scripts: pip install xgboost-1.2.1-cp37-cp37m-win_amd64.whl (installation package name)
# Installation succeeded


  • GraphViz download website
  • The installation package has been put into the file case
  • You need to configure environment variables to use
  • User environment variables are as follows

  • System environment variables are as follows

  • Add the bin directory of Graphviz under path in the system environment variable


pip install prettytable


pip install pydot


pip install Library name

Problems needing attention


  • xgboost algorithm is tolerant, does not deal with null values, and will deal with them effectively
  • xgboost itself can effectively select features and process them without dimensionality reduction
  • Although xgboost can not deal with null values, we study the loss of users. There must be a small number of lost customers, and there will be uneven data, so we need to balance the data. The balance processing enforces that there can be no null value, so we need to process the null value
  • After the mean processing, a numpy matrix is returned, and the features have been lost. We need to label the data again

Training model

  • random_ The state parameter is very important. For the random forest model, it is random in nature. Setting different random states (or not setting the random_state parameter) can completely change the built model

Confusion matrix

  • The confusion matrix output is a matrix without any labels. We need to format it and output it
  • Confusion matrix drawing


  • The model trained by xgboost comes with some visualization methods. xgb.plot_importance outputs the importance of the feature, xgb.to_graphviz outputs a tree rule graph. These two methods can draw directly by giving the trained model and some parameters, without giving the analyzed data.
  • pyechart drawing needs data. By observing the source code, xgboost model can obtain the data obtained after analysis in the following ways
# importance_type = weight is the number of times the feature appears in the tree (the number of splits), gain is the average gain of feature splitting, and cover is the proportion of samples covered as split nodes
importance =model_xgb.get_booster().get_score(
    importance_type='weight', fmap='')
  • After getting the analyzed data and processing the data into a certain format, you can draw the Pye charts

Explanation of tree rule diagram

  • By analyzing the tree rule chart, we can see that from the upper level to the lower level, with the increase of conditions, the greater the probability of losing users (accurately determining users), the more rules, the less the total sample size covered and the number of lost users
  • Leaf (similar to a weak classifier, or a predicted value, a predicted value can be obtained by transforming leaf)

github display picture problem

Show pictures in github folder

  • Open the hosts file under hosts: (C:\Windows\System32\drivers\etc)
  • Edit with notepad and add the following code
# GitHub Start 
#    github.com 
#    gist.github.com    assets-cdn.github.com    raw.githubusercontent.com    gist.githubusercontent.com    cloud.githubusercontent.com    camo.githubusercontent.com    avatars0.githubusercontent.com    avatars1.githubusercontent.com    avatars2.githubusercontent.com    avatars3.githubusercontent.com    avatars4.githubusercontent.com    avatars5.githubusercontent.com    avatars6.githubusercontent.com    avatars7.githubusercontent.com    avatars8.githubusercontent.com
 # GitHub End
  • Just refresh github

Display pictures in github readme file (written by markdown)

  • To create a folder dedicated to storing pictures in the project folder.
  • Open the picture in GitHub and copy the web address
  • Use this URL when writing and inserting pictures in markdown. GitHub readme file can display pictures
  • However, the local readme file cannot display the image because the path cannot be found
  • You need to change the blob in the web address to raw, and both can display pictures

Posted by aaaaCHoooo on Tue, 09 Nov 2021 22:16:36 -0800