Kaggle_Kernel Learning_Home Credit Default Risk_Feature Engineering_baseline Part

After learning the basic knowledge of data analysis, I hope to consolidate knowledge through practical exercises, introductory data contest, and share with you some introductory kernel s on kaggle to integrate the resources I use and the learning process.

Home Credit Default Risk

Address: https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction

Author's homepage: https://www.kaggle.com/willkoehrsen

Background: Home Credit uses a variety of alternative data (including telecommunications and transaction information) to predict the repayment ability of its customers.

Data resources: https://pan.baidu.com/s/1uaMESw1ca_Y9O3YVfrSrEA

Extraction code: d13i

Content: Based on the original author's content, combined with my own understanding and learning of some knowledge points, I also write some reflections on my own practical operation, which is suitable for beginners to learn.

Previous links:

Kaggle_Kernel Learning_Home Credit Default Risk_EDA

In the last blog, we explored the most prominent positive and negative correlation variables, and also discussed some related work of Feature Engineering and machine learning according to their influencing factors.

Feature engineering is a basic process, which can include feature construction: adding new features from existing data, and feature selection: selecting only the most important features or other dimension reduction methods. We can use many techniques to create and select features. This paper mainly uses Polynomial Features and Domain knowledge features to construct and adjust features.

1. Polynomial Features Generation from Polynomial Features

In this method, the function we create is the existing strong correlation features and the interaction items between these features. For example, we can create variables EXT_SOURCE_1^ 2 and EXT_SOURCE_2^ 2, as well as variables such as EXT_SOURCE_1 x EXT_SOURCE_2, EXT_SOURCE_1 xEXT_SOURCE_2^2, EXT_SOURCE_1^ 2, EXT_SOURCE_2^2, EXT_SOURCE_1^ 2, EXT_SOURCE 2^2,

And so on. These functions, which are composed of multiple individual variables, are called interaction items because they capture interactions between variables. In other words, although the two variables themselves may not have a strong impact on the target, combining them to form an interactive variable may show the relationship with the target.

# Extracting Several Strongly Related Features and TARGET Attributes
poly_features_train = data_train[['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','DAYS_BIRTH','TARGET']]
poly_features_test = data_test[['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','DAYS_BIRTH']]

# Store TARGET separately
poly_target = poly_features_train['TARGET']
# Remove TARGET and keep test / train consistent
poly_features_train = poly_features_train.drop(columns=['TARGET'])

Imputer interpolation is used to fill in the gaps before the change.

from sklearn.preprocessing import Imputer
# The strategy is to fill in the middle value
impt = Imputer(strategy='median')
# Transform directly with fit_transform method
# Note that the result is a digital matrix.
poly_features_train = impt.fit_transform(poly_features_train)
poly_features_test = impt.fit_transform(poly_features_test)

poly_features_train

array([[8.30369674e-02, 2.62948593e-01, 1.39375780e-01, 9.46100000e+03],
       [3.11267311e-01, 6.22245775e-01, 5.35276250e-01, 1.67650000e+04],
       [5.05997931e-01, 5.55912083e-01, 7.29566691e-01, 1.90460000e+04],
       ...,
       [7.44026400e-01, 5.35721752e-01, 2.18859082e-01, 1.49660000e+04],
       [5.05997931e-01, 5.14162820e-01, 6.61023539e-01, 1.19610000e+04],
       [7.34459669e-01, 7.08568896e-01, 1.13922396e-01, 1.68560000e+04]])

Call Polynomial Features for transformation

from sklearn.preprocessing import PolynomialFeatures
poly_transformer = PolynomialFeatures(degree=3)
poly_transformer.fit(poly_features_train)
poly_features_train = poly_transformer.transform(poly_features_train)
poly_features_test = poly_transformer.transform(poly_features_test)

The original columns need to be renamed due to the loss of them

# Invoke the get_feature_names() method to see the polynomial composition generated by the input variable when degree = 3
# Note that the order used to construct the DataFrame is the same
poly_transformer.get_feature_names(input_features=['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH'])[:15]

['1',
 'EXT_SOURCE_1',
 'EXT_SOURCE_2',
 'EXT_SOURCE_3',
 'DAYS_BIRTH',
 'EXT_SOURCE_1^2',
 'EXT_SOURCE_1 EXT_SOURCE_2',
 'EXT_SOURCE_1 EXT_SOURCE_3',
 'EXT_SOURCE_1 DAYS_BIRTH',
 'EXT_SOURCE_2^2',
 'EXT_SOURCE_2 EXT_SOURCE_3',
 'EXT_SOURCE_2 DAYS_BIRTH',
 'EXT_SOURCE_3^2',
 'EXT_SOURCE_3 DAYS_BIRTH',
 'DAYS_BIRTH^2',
 'EXT_SOURCE_1^3',
 'EXT_SOURCE_1^2 EXT_SOURCE_2',
 'EXT_SOURCE_1^2 EXT_SOURCE_3',
 'EXT_SOURCE_1^2 DAYS_BIRTH',
 'EXT_SOURCE_1 EXT_SOURCE_2^2',
 'EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3',
 'EXT_SOURCE_1 EXT_SOURCE_2 DAYS_BIRTH',
 'EXT_SOURCE_1 EXT_SOURCE_3^2',
 'EXT_SOURCE_1 EXT_SOURCE_3 DAYS_BIRTH',
 'EXT_SOURCE_1 DAYS_BIRTH^2',
 'EXT_SOURCE_2^3',
 'EXT_SOURCE_2^2 EXT_SOURCE_3',
 'EXT_SOURCE_2^2 DAYS_BIRTH',
 'EXT_SOURCE_2 EXT_SOURCE_3^2',
 'EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH',
 'EXT_SOURCE_2 DAYS_BIRTH^2',
 'EXT_SOURCE_3^3',
 'EXT_SOURCE_3^2 DAYS_BIRTH',
 'EXT_SOURCE_3 DAYS_BIRTH^2',
 'DAYS_BIRTH^3']

# Converting existing column names and data to Dataframe
poly_features_train = pd.DataFrame(poly_features_train,
                                   columns=poly_transformer.get_feature_names([
                                       'EXT_SOURCE_1', 'EXT_SOURCE_2',
                                       'EXT_SOURCE_3', 'DAYS_BIRTH'
                                   ]))
poly_features_test = pd.DataFrame(poly_features_test,
                                 columns=poly_transformer.get_feature_names([
                                      'EXT_SOURCE_1', 'EXT_SOURCE_2',
                                       'EXT_SOURCE_3', 'DAYS_BIRTH'
                                 ]))

poly_features_train.head()

View new property dependencies

# Growth of objective function
poly_features['TARGET'] = poly_target
# Find the correlation coefficient
poly_corrs = poly_features.corr()['TARGET'].sort_values()

# View Extremum
print('head:\n',poly_corrs.head(5))
print('\ntail:\n',poly_corrs.tail(5))

head:
EXT_SOURCE_2 EXT_SOURCE_3                -0.193939
EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3   -0.189605
EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH     -0.181283
EXT_SOURCE_2^2 EXT_SOURCE_3              -0.176428
EXT_SOURCE_2 EXT_SOURCE_3^2              -0.172282
Name: TARGET, dtype: float64

tail:
DAYS_BIRTH     -0.078239
DAYS_BIRTH^2   -0.076672
DAYS_BIRTH^3   -0.074273
TARGET          1.000000
1                    NaN
Name: TARGET, dtype: float64

Combining the new features with the original DataFrame

# Select SK_ID_CURR as the join column
poly_features_train['SK_ID_CURR'] = data_train['SK_ID_CURR']
poly_features_test['SK_ID_CURR'] = data_test['SK_ID_CURR']

# Parameter on to specify the primary key for data set merging
data_train_poly = data_train.merge(poly_features_train, on='SK_ID_CURR', how='left')
data_test_poly = data_test.merge(poly_features_test, on='SK_ID_CURR', how='left')

# Uniform column
data_train_poly, data_test_poly = data_train_poly.align(data_test_poly, join='inner',axis=1)

# View Dimensions
print('Dimension of training set after polynomial generation: ', data_train_poly.shape)
print('Test Set Dimensions after Polynomial Generation: ', data_test_poly.shape)

Dimensions of training set after polynomial generation: (307511, 275)
Dimensions of test set after polynomial generation: (48744, 275)

2. New features of domain knowledge construction

CREDIT_INCOME_PERCENT: Percentage of Credit to Customer Income

ANNUITY_INCOME_PERCENT: Percentage of loan annuity relative to customer income

CREDIT_TERM: Term of payment in monthly units (because annuity is the amount due each month)

DAYS_EMPLOYED_PERCENT: Number of days of employment relative to customer age

data_train_domain = data_train.copy()
data_test_domain = data_test.copy()

# Separate structural features

# Part Construction of Training Set
data_train_domain['CREDIT_INCOME_PERCENT'] = data_train_domain['AMT_CREDIT'] / data_train_domain['AMT_INCOME_TOTAL']
data_train_domain['ANNUITY_INCOME_PERCENT'] = data_train_domain['AMT_ANNUITY'] / data_train_domain['AMT_INCOME_TOTAL']
data_train_domain['CREDIT_TERM'] = data_train_domain['AMT_ANNUITY'] / data_train_domain['AMT_CREDIT']
data_train_domain['DAYS_EMPLOYED_PERCENT'] = data_train_domain['DAYS_EMPLOYED'] / data_train_domain['DAYS_BIRTH']

# Part Construction of Test Set
data_test_domain['CREDIT_INCOME_PERCENT'] = data_test_domain['AMT_CREDIT'] / data_test_domain['AMT_INCOME_TOTAL']
data_test_domain['ANNUITY_INCOME_PERCENT'] = data_test_domain['AMT_ANNUITY'] / data_test_domain['AMT_INCOME_TOTAL']
data_test_domain['CREDIT_TERM'] = data_test_domain['AMT_ANNUITY'] / data_test_domain['AMT_CREDIT']
data_test_domain['DAYS_EMPLOYED_PERCENT'] = data_test_domain['DAYS_EMPLOYED'] / data_test_domain['DAYS_BIRTH']

plt.figure(figsize = (8, 16))
# iteration
for i, feature in enumerate(['CREDIT_INCOME_PERCENT', 'ANNUITY_INCOME_PERCENT', 'CREDIT_TERM', 'DAYS_EMPLOYED_PERCENT']): 
    # subplot drawing subgraph
    plt.subplot(4, 1, i + 1)
    
    # repay
    sns.kdeplot(data_train_domain.loc[data_train_domain['TARGET'] == 0, feature], label = 'target == 0')
    # Be overdue
    sns.kdeplot(data_train_domain.loc[data_train_domain['TARGET'] == 1, feature], label = 'target == 1')
    
    # Label
    plt.title('Distribution of %s by Target Value' % feature)
    plt.xlabel('%s' % feature);
    plt.ylabel('Density');
    
plt.tight_layout(h_pad = 2.5)

data_corr = data_train_domain[['CREDIT_INCOME_PERCENT', 'ANNUITY_INCOME_PERCENT', 'CREDIT_TERM', 'DAYS_EMPLOYED_PERCENT','TARGET']]
data_corr.corr()['TARGET']

CREDIT_INCOME_PERCENT    -0.007727
ANNUITY_INCOME_PERCENT    0.014265
CREDIT_TERM               0.012704
DAYS_EMPLOYED_PERCENT     0.067955
TARGET                    1.000000
Name: TARGET, dtype: float64

Looking at the correlation coefficient, it's hard to say whether it works or not.

BASELINE

Some of the previous processing is just to explore the possibility of data, not complete data preprocessing, here still need to go a set of first.
preprocessing process

# Import related modules including normalization/interpolation
from sklearn.preprocessing import MinMaxScaler, Imputer

# Separation of TARGET 
if 'TARGET' in data_train:
    train = data_train.drop(columns = ['TARGET'])
else:
    train = data_train.copy()

# Test Set 
test = data_test.copy()

# Median Filling
imputer = Imputer(strategy = 'median')

# Data range conversion to 0-1
scaler = MinMaxScaler(feature_range = (0, 1))

# train
imputer.fit(train)
scaler.fit(train)

# Imputer transformation
train = imputer.transform(train)
test = imputer.transform(test)

# MinMax Scaler transformation
train = scaler.transform(train)
test = scaler.transform(test)

print('Training data shape: ', train.shape)
print('Testing data shape: ', test.shape)

Training data shape:  (307511, 240)
Testing data shape:  (48744, 240)

Make a baseline prediction

from sklearn.linear_model import LogisticRegression

# No search
log_reg = LogisticRegression(C = 0.0001)

#train
log_reg.fit(train, train_labels)

LogisticRegression(C=0.0001, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='warn', tol=0.0001, verbose=0, warm_start=False)

# Using Predict_proba() Method to Predict the Possibility of Result 1
# Converted to a prediction of the probability of eventual overdue
log_reg_pred = log_reg.predict_proba(test)[:, 1]

# Construct Dataframe to prepare for submission
submit = data_test[['SK_ID_CURR']]
submit['TARGET'] = log_reg_pred
submit.head()

	SK_ID_CURR	TARGET
0	100001		0.087750
1	100005		0.163957
2	100013		0.110238
3	100028		0.076575
4	100038		0.154924

# Save as csv
submit.to_csv('log_reg_baseline.csv', index = False)

The specific selection and improvement of the model and other follow-up work should be done after seeing more about the algorithm, so as not to mislead others.

Posted by djelica on Fri, 02 Aug 2019 00:46:05 -0700

Programmer Group