NO.89 - Applying Xgboost to Predict Insurance Compensation

Keywords: jupyter less github encoding

Article Directory

My github address.

1 Data analysis

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score as AUC
from sklearn.metrics import mean_absolute_error
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
from sklearn.model_selection import cross_val_score

from scipy import stats
import seaborn as sns
from copy import deepcopy

%matplotlib inline

# This may raise an exception in earlier versions of Jupyter
%config InlineBackend.figure_format = 'retina'

1.1 First look at what the data looks like

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

(188318, 132)

print ('First 20 columns:', list(train.columns[:20]))

print ('Last 20 columns:', list(train.columns[-20:]))

First 20 columns: ['id', 'cat1', 'cat2', 'cat3', 'cat4', 'cat5', 'cat6', 'cat7', 'cat8', 'cat9', 'cat10', 'cat11', 'cat12', 'cat13', 'cat14', 'cat15', 'cat16', 'cat17', 'cat18', 'cat19']
Last 20 columns: ['cat112', 'cat113', 'cat114', 'cat115', 'cat116', 'cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6', 'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13', 'cont14', 'loss']

We can see that there are approximately 116 species attributes (as their names show) and 14 consecutive (numeric) attributes.In addition, there are ID s and compensation.The total is 132 columns.



As we can see, all continuous functions have been scaled to [0,1] intervals, with a mean of almost 0.5.Actually, the data has already been preprocessed. What we get is the feature data.

View missing values



1.2 Continuous and categorical variables

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188318 entries, 0 to 188317
Columns: 132 entries, id to loss
dtypes: float64(15), int64(1), object(116)
memory usage: 189.7+ MB
Here, float64(15) is the 14 continuous variable + loss value; int64(1) is the ID;
object(116) is a classification variable.

cat_features = list(train.select_dtypes(include=['object']).columns)
print ("Categorical: {} features".format(len(cat_features)))

Categorical: 116 features

cont_features = [cont for cont in list(train.select_dtypes(
                 include=['float64', 'int64']).columns) if cont not in ['loss', 'id']]
print ("Continuous: {} features".format(len(cont_features)))

Continuous: 14 features

id_col = list(train.select_dtypes(include=['int64']).columns)
print ("A column of int64: {}".format(id_col))

A column of int64: ['id']

1.3 Number of attributes in categorical variables

cat_uniques = []
for cat in cat_features:
uniq_values_in_categories = pd.DataFrame.from_items([('cat_name', cat_features), ('unique_values', cat_uniques)])

1.4 Compensation value

plt.plot(train['id'], train['loss'])
plt.title('Loss values per id')

Several significant peaks in the loss values represent serious accidents.This distribution of data makes regression performance poor due to very distorted functionality.

Basically, skewness measures the asymmetry of the mean distribution of real random variables.Let's calculate the skewness of the loss:



A skewness greater than 1 indicates that the data is indeed skewed

Logarithmic transformation of the data usually improves skewing, using np.log



fig, (ax1, ax2) = plt.subplots(1,2)
ax1.hist(train['loss'], bins=50)
ax1.set_title('Train Loss target histogram')
ax2.hist(np.log(train['loss']), bins=50, color='g')
ax2.set_title('Train Log Loss target histogram')

1.5 Continuous Variable Characteristics

train[cont_features].hist(bins=50, figsize=(16,12))

1.6 Correlation between features

correlation_mat = train[cont_features].corr()
sns.heatmap(correlation_mat, annot=True)

2 Xgboost

import xgboost as xgb
import pandas as pd
import numpy as np
import pickle
import sys
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, make_scorer
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix, hstack
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from xgboost import XGBRegressor

import warnings

%matplotlib inline

# This may raise an exception in earlier versions of Jupyter
%config InlineBackend.figure_format = 'retina'

2.1 Data Preprocessing

train = pd.read_csv('train.csv')

Make a logarithmic transformation

train['log_loss'] = np.log(train['loss'])

Data is divided into continuous and discrete features

features = [x for x in train.columns if x not in ['id','loss', 'log_loss']]
#cat column, categorical feature
cat_features = [x for x in train.select_dtypes(
        include=['object']).columns if x not in ['id','loss', 'log_loss']]
#cont column, numeric characteristics
num_features = [x for x in train.select_dtypes(
        exclude=['object']).columns if x not in ['id','loss', 'log_loss']]

print ("Categorical features:", len(cat_features))
print ("Numerical features:", len(num_features))

Categorical features: 116
Numerical features: 14

Encoding classification variables

ntrain = train.shape[0]

train_x = train[features]
train_y = train['log_loss']

for c in range(len(cat_features)):
    train_x[cat_features[c]] = train_x[cat_features[c]].astype('category') encodes categorical variables with numbers
print ("Xtrain:", train_x.shape)
print ("ytrain:", train_y.shape)

Xtrain: (188318, 130)
ytrain: (188318,)

2.2 Simple Xgboost model

First, we train a basic xgboost model, then adjust the parameters to transform the observations through cross-validation, measured by the average absolute error

mean_absolute_error(np.exp(y), np.exp(yhat)).

xgboost customizes a data matrix class, DMatrix, which preprocesses at the beginning of training to improve the efficiency of each subsequent iteration

def xg_eval_mae(yhat, dtrain):
    y = dtrain.get_label()
    return 'mae', mean_absolute_error(np.exp(y), np.exp(yhat))


dtrain = xgb.DMatrix(train_x, train_y)

Xgboost parameter

  • 'booster':'gbtree'.Node splitting using a gradient-elevated decision tree
  • 'objective':'multi:softmax', multi-classification problem.Loss function, categorized, regressive
  • 'num_class': 10, number of categories, used with multisoftmax
  • 'gamma': how much loss will decrease before splitting
  • 'max_depth': 12, the deeper a tree is built, the larger it is, the easier it will fit
  • 'lambda': 2, the L2 regularization item parameter that controls the weight value of the model complexity, the larger the parameter, the harder the model will be to fit.
  • 'subsample': 0.7, random sampling of training samples (sampling of samples)
  • 'colsample_bytree': 0.7, column sampling when spanning a tree (feature sampling)
  • 'min_child_weight': 3, the smallest sample weight sum in the child nodes.The splitting process ends if the sample weight of a leaf node is less than min_child_weight
  • 'silent': 0, set to 1 without running information output, preferably 0.
  • 'eta': 0.007, like the learning rate, the number of trees added serially, the contribution of newly added trees.Empirically, increase the number of trees and decrease the learning rate
  • 'seed':1000,
  • 'nthread': 7, number of CPU threads
xgb_params = {
    'seed': 0,
    'eta': 0.1,
    'colsample_bytree': 0.5,
    'silent': 1,
    'subsample': 0.5,
    'objective': 'reg:linear',
    'max_depth': 5,
    'min_child_weight': 3

Using cross-validation


bst_cv1 =, dtrain, num_boost_round=50, nfold=3, seed=0, 
                feval=xg_eval_mae, maximize=False, early_stopping_rounds=10)

print ('CV score:', bst_cv1.iloc[-1,:]['test-mae-mean'])   #[-1,:] Take the last element

CV score: 1220.054769
CPU times: user 2min 24s, sys: 1.54 s, total: 2min 25s
Wall time: 2min 26s
We have the first baseline result: MAE=1218.9

bst_cv1[['train-mae-mean', 'test-mae-mean']].plot()

2.3 First Base Model

In the tree model above:

  • Fitting did not occur
  • Only 50 tree models have been built
#Building 100 tree models
bst_cv2 =, dtrain, num_boost_round=100, 
                nfold=3, seed=0, feval=xg_eval_mae, maximize=False, 

print ('CV score:', bst_cv2.iloc[-1,:]['test-mae-mean'])

CV score: 1171.2875569999999
CPU times: user 4min 47s, sys: 2.05 s, total: 4min 49s
Wall time: 4min 51s

fig, (ax1, ax2) = plt.subplots(1,2)
#100 trees
ax1.set_title('100 rounds of training')
ax1.plot(bst_cv2[['train-mae-mean', 'test-mae-mean']])
ax1.legend(['Training Loss', 'Test Loss'])
#Last 60 Trees
ax2.set_title('60 last rounds of training')
ax2.plot(bst_cv2.iloc[40:][['train-mae-mean', 'test-mae-mean']])
ax2.legend(['Training Loss', 'Test Loss'])

There's so little fitting going on, it's not that important yet

We've got a new record MAE = 1171.77 which is better than the first one (1218.9). Next we'll change the other parameters.

2.4 Xgboost Parameter Adjustment

  • Step 1: Select a set of initial parameters
  • Step 2: Change max_depth and min_child_weight.
  • Step 3: Adjusting gamma reduces the risk of model overfitting.
  • Step 4: Adjust subsample and colsample_bytree to change data sampling strategy.
  • Step 5: Adjust learning rate eta.
class XGBoostRegressor(object):
    def __init__(self, **kwargs):
        self.params = kwargs
        if 'num_boost_round' in self.params:
            self.num_boost_round = self.params['num_boost_round']
        self.params.update({'silent': 1, 'objective': 'reg:linear', 'seed': 0})
    #Training model    
    def fit(self, x_train, y_train):
        dtrain = xgb.DMatrix(x_train, y_train)
        self.bst = xgb.train(params=self.params, dtrain=dtrain, num_boost_round=self.num_boost_round,
                             feval=xg_eval_mae, maximize=False)
    def predict(self, x_pred):
        dpred = xgb.DMatrix(x_pred)
        return self.bst.predict(dpred)
    def kfold(self, x_train, y_train, nfold=5):
        dtrain = xgb.DMatrix(x_train, y_train)
        cv_rounds =, dtrain=dtrain, num_boost_round=self.num_boost_round,
                           nfold=nfold, feval=xg_eval_mae, maximize=False, early_stopping_rounds=10)
        return cv_rounds.iloc[-1,:]
    def plot_feature_importances(self):
        feat_imp = pd.Series(self.bst.get_fscore()).sort_values(ascending=False)
        feat_imp.plot(title='Feature Importances')
        plt.ylabel('Feature Importance Score')
    def get_params(self, deep=True):
        return self.params
    def set_params(self, **params):
        return self
#Evaluation Indicators
def mae_score(y_true, y_pred):
    return mean_absolute_error(np.exp(y_true), np.exp(y_pred))

mae_scorer = make_scorer(mae_score, greater_is_better=False)  #make_scorer
bst = XGBoostRegressor(eta=0.1, colsample_bytree=0.5, subsample=0.5, 
                       max_depth=5, min_child_weight=3, num_boost_round=50)
bst.kfold(train_x, train_y, nfold=5)

test-mae-mean 1218.528027
test-mae-std 10.423910
test-rmse-mean 0.562570
test-rmse-std 0.002914
train-mae-mean 1209.757422
train-mae-std 2.306814
train-rmse-mean 0.558842
train-rmse-std 0.000475
Name: 49, dtype: float64

Step 1: Learning Rate and Number of Trees

Step 2: Tree Depth and Node Weight
These parameters have the greatest impact on xgboost performance, so they should adjust first.We briefly outline them:

  • max_depth: The maximum depth of a tree.Increasing this value will make the model more complex and prone to fitting, and a depth of 3-10 is reasonable.
  • min_child_weight: Regularization parameter. Stops the tree building process if the instance weight in the tree partition is less than the defined sum.
xgb_param_grid = {'max_depth': list(range(4,9)), 'min_child_weight': list((1,3,6))}
grid = GridSearchCV(XGBoostRegressor(eta=0.1, num_boost_round=50, colsample_bytree=0.5, subsample=0.5),
                param_grid=xgb_param_grid, cv=5, scoring=mae_scorer), train_y.values)
grid.cv_results_, grid.best_params_, grid.best_score_

Best results from grid search:

{'max_depth': 8, 'min_child_weight': 6},

Set to negative because you are looking for a large value

Step 3: Adjusting gamma to reduce the risk of over-fitting


xgb_param_grid = {'gamma':[ 0.1 * i for i in range(0,5)]}

grid = GridSearchCV(XGBoostRegressor(eta=0.1, num_boost_round=50, max_depth=8, min_child_weight=6,
                                        colsample_bytree=0.5, subsample=0.5),
                    param_grid=xgb_param_grid, cv=5, scoring=mae_scorer), train_y.values)
grid.grid_scores_, grid.best_params_, grid.best_score_

([mean: -1187.95975, std: 6.71340, params: {'gamma': 0.0},
mean: -1187.67788, std: 6.44332, params: {'gamma': 0.1},
mean: -1187.66616, std: 6.75004, params: {'gamma': 0.2},
mean: -1187.21835, std: 7.06771, params: {'gamma': 0.30000000000000004},
mean: -1188.35004, std: 6.50057, params: {'gamma': 0.4}],
{'gamma': 0.30000000000000004},

We chose to use a smaller gamma.

Step 4: Adjust sample sampling subsample and colsample_bytree


xgb_param_grid = {'subsample':[ 0.1 * i for i in range(6,9)],
                      'colsample_bytree':[ 0.1 * i for i in range(6,9)]}

grid = GridSearchCV(XGBoostRegressor(eta=0.1, gamma=0.2, num_boost_round=50, max_depth=8, min_child_weight=6),
                    param_grid=xgb_param_grid, cv=5, scoring=mae_scorer), train_y.values)
grid.grid_scores_, grid.best_params_, grid.best_score_

([mean: -1185.67108, std: 5.40097, params: {'colsample_bytree': 0.6000000000000001, 'subsample': 0.6000000000000001},
mean: -1184.90641, std: 5.61239, params: {'colsample_bytree': 0.6000000000000001, 'subsample': 0.7000000000000001},
mean: -1183.73767, std: 6.15639, params: {'colsample_bytree': 0.6000000000000001, 'subsample': 0.8},
mean: -1185.09329, std: 7.04215, params: {'colsample_bytree': 0.7000000000000001, 'subsample': 0.6000000000000001},
mean: -1184.36149, std: 5.71298, params: {'colsample_bytree': 0.7000000000000001, 'subsample': 0.7000000000000001},
mean: -1183.83446, std: 6.24654, params: {'colsample_bytree': 0.7000000000000001, 'subsample': 0.8},
mean: -1184.43055, std: 6.68009, params: {'colsample_bytree': 0.8, 'subsample': 0.6000000000000001},
mean: -1183.33878, std: 5.74989, params: {'colsample_bytree': 0.8, 'subsample': 0.7000000000000001},
mean: -1182.93099, std: 5.75849, params: {'colsample_bytree': 0.8, 'subsample': 0.8}],
{'colsample_bytree': 0.8, 'subsample': 0.8},

_, scores =  convert_grid_scores(grid.grid_scores_)
scores = scores.reshape(3,3)

cp = plt.contourf(xgb_param_grid['subsample'], xgb_param_grid['colsample_bytree'], scores, cmap='BrBG')
plt.title('Subsampling params tuning')
plt.annotate('Optimum', xy=(0.895, 0.6), xytext=(0.8, 0.695), arrowprops=dict(facecolor='black'))

In the specific case of the current pre-training mode, I get the following results:

`{'colsample_bytree': 0.8, 'subsample': 0.8}, -1182.9309918891634)

Step 5: Reduce learning rate and increase number of trees

The final step in parameter optimization is to slow down learning while increasing more estimates.

1. First we iterate over 50 trees

xgb_param_grid = {'eta':[0.5,0.4,0.3,0.2,0.1,0.075,0.05,0.04,0.03]}
grid = GridSearchCV(XGBoostRegressor(num_boost_round=50, gamma=0.2, max_depth=8, min_child_weight=6,
                                        colsample_bytree=0.6, subsample=0.9),
                    param_grid=xgb_param_grid, cv=5, scoring=mae_scorer), train_y.values)
grid.grid_scores_, grid.best_params_, grid.best_score_

([mean: -1205.85372, std: 3.46146, params: {'eta': 0.5},
mean: -1185.32847, std: 4.87321, params: {'eta': 0.4},
mean: -1170.00284, std: 4.76399, params: {'eta': 0.3},
mean: -1160.97363, std: 6.05830, params: {'eta': 0.2},
mean: -1183.66720, std: 6.69439, params: {'eta': 0.1},
mean: -1266.12628, std: 7.26130, params: {'eta': 0.075},
mean: -1709.15130, std: 8.19994, params: {'eta': 0.05},
mean: -2104.42708, std: 8.02827, params: {'eta': 0.04},
mean: -2545.97334, std: 7.76440, params: {'eta': 0.03}],
{'eta': 0.2},

eta, y = convert_grid_scores(grid.grid_scores_)
plt.title('MAE and ETA, 50 trees')
plt.plot(eta, -y)

{'eta': 0.2}, -1160.9736284869114 is currently the best result

2. Now let's increase the tree to 100

xgb_param_grid = {'eta':[0.5,0.4,0.3,0.2,0.1,0.075,0.05,0.04,0.03]}
grid = GridSearchCV(XGBoostRegressor(num_boost_round=100, gamma=0.2, max_depth=8, min_child_weight=6,
                                        colsample_bytree=0.6, subsample=0.9),
                    param_grid=xgb_param_grid, cv=5, scoring=mae_scorer), train_y.values)
grid.grid_scores_, grid.best_params_, grid.best_score_

([mean: -1231.04517, std: 5.41136, params: {'eta': 0.5},
mean: -1201.31398, std: 4.75456, params: {'eta': 0.4},
mean: -1177.86344, std: 3.67324, params: {'eta': 0.3},
mean: -1160.48853, std: 5.65336, params: {'eta': 0.2},
mean: -1152.24715, std: 5.85286, params: {'eta': 0.1},
mean: -1156.75829, std: 5.30250, params: {'eta': 0.075},
mean: -1184.88913, std: 6.08852, params: {'eta': 0.05},
mean: -1243.60808, std: 7.40326, params: {'eta': 0.04},
mean: -1467.04736, std: 8.70704, params: {'eta': 0.03}],
{'eta': 0.1},

eta, y = convert_grid_scores(grid.grid_scores_)
plt.title('MAE and ETA, 100 trees')
plt.plot(eta, -y)

A lower learning rate has better results

3. Continue to increase the number of trees to 200


xgb_param_grid = {'eta':[0.09,0.08,0.07,0.06,0.05,0.04]}
grid = GridSearchCV(XGBoostRegressor(num_boost_round=200, gamma=0.2, max_depth=8, min_child_weight=6,
                                        colsample_bytree=0.6, subsample=0.9),
                    param_grid=xgb_param_grid, cv=5, scoring=mae_scorer), train_y.values)
grid.grid_scores_, grid.best_params_, grid.best_score_

([mean: -1148.37246, std: 6.51203, params: {'eta': 0.09},
mean: -1146.67343, std: 6.13261, params: {'eta': 0.08},
mean: -1145.92359, std: 5.68531, params: {'eta': 0.07},
mean: -1147.44050, std: 6.33336, params: {'eta': 0.06},
mean: -1147.98062, std: 6.39481, params: {'eta': 0.05},
mean: -1153.17886, std: 5.74059, params: {'eta': 0.04}],
{'eta': 0.07},

eta, y = convert_grid_scores(grid.grid_scores_)
plt.title('MAE and ETA, 200 trees')
plt.plot(eta, -y)

3 Summary


# Final XGBoost model

bst = XGBoostRegressor(num_boost_round=200, eta=0.07, gamma=0.2, max_depth=8, min_child_weight=6,
                                        colsample_bytree=0.6, subsample=0.9)
cv = bst.kfold(train_x, train_y, nfold=5)

test-mae-mean 1146.997852
test-mae-std 9.541592
train-mae-mean 1036.557251
train-mae-std 0.974437
Name: 199, dtype: float64

We see that the best ETA for 200 trees is 0.07.As we expected, the ETA and num_boost_round dependencies are not linear, but there are some associations.

They spent a considerable amount of time optimizing xgboost. From the initial value: 1219.57. After tuning, they reached MAE=1171.77.

We also found a relationship between ETA and num_boost_round:

  • 100 trees, eta=0.1: MAE=1152.247
  • 200 trees, eta=0.07: MAE=1145.92

`XGBoostRegressor(num_boost_round=200, gamma=0.2, max_depth=8, min_child_weight=6, colsample_bytree=0.6, subsample=0.9, eta=0.07).

90 original articles published. 60% praised. 90,000 visits+
Private letter follow

Posted by rimelta on Wed, 11 Mar 2020 21:00:47 -0700