python imblearn toolbox to solve data imbalance problems —— joint sampling, integrated sampling, other details

Keywords: Session

1. Combination of over-and under-sampling

The main solution is to generate noise samples in the SMOTE algorithm, and the solution is to clean the space resulting from over-sampling.
The main idea is to use SMOTE for up-sampling first, and then get one by Tomek's link or edited nearest-neighbours method.
The corresponding functions are: SMOTETomek and SMOTTEENN.

from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)

from imblearn.combine import SMOTETomek
smote_tomek = SMOTETomek(random_state=0)
X_resampled, y_resampled = smote_tomek.fit_resample(X, y)

2. Ensemble of samplers

2.1 Bagging classifier

**Bagging:**There are different subsets of the sample generated from the playback sample, and then classifiers are established on each subset (to give the classifier type).
In scikit-learn s, there are classes BaggingClassifier, but for unbalanced data, it is not guaranteed that the data in each subset is balanced, so the classification results tend to favor most classes.
In imblearn, the class BalaceBaggingClassifier causes resampling on each subset before each classifier is trained, with the same parameters as BaggingClassifier in sklearn, except that two parameters are added: sampling_strategy and replacement to control how random downsampling occurs.

from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.metrics import balanced_accuracy_score
bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                sampling_strategy='auto',
                                replacement=False,
                                random_state=0)
bbc.fit(X_train, y_train)
y_pred =bbc.predict(X_test)
balanced_accuracy_score(y_test, y_pred)#Calculate Balance Accuracy

2.2 Forest of randomized trees

Use a balanced subset of bootstrap data when building each tree.

from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(n_estimators=100,random_state=0)
brf.fit(X_train, y_train)

2.3 Boosting

Training n weak classifiers on a subset of the dataset, weighting the N weak classifiers and fusing them to produce a classifier with the final result.

2.3.1 RUSBoostClassifier

Perform a random downsampling before executing the boosting iteration.

from imblearn.ensemble import RUSBoostClassifier
rusboost  = RUSBoostClassifier(random_state=0)
rusboost.fit(X_train, y_train)

2.3.2 EasyEnsembleClassifier, using Adaboost

Calculates the error rate of a weak classifier, assigns larger weights to the incorrectly classified samples, and assigns smaller weights to the correctly classified samples.As long as the classification accuracy is greater than 0.5, it can be a member of the final classifier. The higher the accuracy of the weak classifier, the greater the weight.

from imblearn.ensemble import EasyEnsembleClassifier
eec = EasyEnsembleClassifier(random_state=0)
eec.fit(X_train, y_train)

3. Miscellaneous samplers

3.1 Custom sampler: FunctionSampler

from imblearn import FunctionSampler
def fuc(X, y):
    return X[:10], y[:10]
sampler = FunctionSampler(func=func)
X_res, y_res = sampler.fit_resample(X, y)

3.2 Custom generators (mini-batches balanced for TensorFlow and Keras)

3.2.1 Tensorflow generator: imblearn.tensorflow.balanced_batch_generator

import numpy as np
X = X.astype(np.float32)
from imblearn.under_sampling import RandomUnderSampler
from imblearn.tensorflow import balanced_batch_generator
training_generator, steps_per_epoch = balanced_batch_generator(
    X, y, sample_weight=None, sampler=RandomUnderSampler(),
    batch_size=10, random_state=42)

#How to use training_generator and steps_per_epoch:
learning_rate, epochs = 0.01, 10
input_size, output_size = X.shape[1], 3
import tensorflow as tf
def init_weights(shape):
     return tf.Variable(tf.random_normal(shape, stddev=0.01))
def accuracy(y_true, y_pred):
     return np.mean(np.argmax(y_pred, axis=1) == y_true)
 # input and output
data = tf.placeholder("float32", shape=[None, input_size])
targets = tf.placeholder("int32", shape=[None])
# build the model and weights
W = init_weights([input_size, output_size])
b = init_weights([output_size])
out_act = tf.nn.sigmoid(tf.matmul(data, W) + b)
# build the loss, predict, and train operator
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
     logits=out_act, labels=targets)
loss = tf.reduce_sum(cross_entropy)
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train_op = optimizer.minimize(loss)
predict = tf.nn.softmax(out_act)
# Initialization of all variables in the graph
init = tf.global_variables_initializer()
with tf.Session() as sess:
     print('Starting training')
     sess.run(init)
     for e in range(epochs):
         for i in range(steps_per_epoch):  ##Mainly here
             X_batch, y_batch = next(training_generator) ##Mainly here
             sess.run([train_op, loss], feed_dict={data: X_batch, targets: y_batch})
         # For each epoch, run accuracy on train and test
         feed_dict = dict()
         predicts_train = sess.run(predict, feed_dict={data: X})
         print("epoch: {} train accuracy: {:.3f}"
               .format(e, accuracy(y, predicts_train)))

3.2 Keras generator

##Define a Logistic Regression Model
import keras
y = keras.utils.to_categorical(y, 3)
model = keras.Sequential()
model.add(keras.layers.Dense(y.shape[1], input_dim=X.shape[1],
                              activation='softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy',
               metrics=['accuracy'])
##keras.balanced_batch_generator generates balanced min-batch
from imblearn.keras import balanced_batch_generator
training_generator, steps_per_epoch = balanced_batch_generator(
     X, y, sampler=RandomUnderSampler(), batch_size=10, random_state=42)

##Or use keras.BalancedBatchGenerator
from imblearn.keras import BalancedBatchGenerator
training_generator = BalancedBatchGenerator(
     X, y, sampler=RandomUnderSampler(), batch_size=10, random_state=42)
callback_history = model.fit_generator(generator=training_generator,
                                        epochs=10, verbose=0)

4. Metrics (Metrics)

Currently, sklearns only measures sklearn.metrics.balanced_accuracy_score for unbalanced data.
imblearn.metrics provides two other metrics for evaluating the quality of classifiers

4.1 Sensitivity and specificity metrics

  • Sensitivity: true positive rate is recall.
  • Specificity: true negative rate.
    So three measures have been added
  • Sensitivity_s pecificity_support: Output sensitivity and specificficity and support
  • sensitivity_score
  • specificity_score

4.2 Additional metrics specific to imbalanced datasets

A measure designed to increase unbalanced data

  • geometric_mean_score: Calculates the geometric mean (G-mean, the square of the various sensitivity products) as follows:

The The geometric mean (G-mean) is the root of the product of class-wise sensitivity. This measure tries to maximize the accuracy on each of the classes while keeping these accuracies balanced. For binary classification G-mean is the squared root of the product of the sensitivity and specificity. For multi-class problems it is a higher root of the product of sensitivity for each class.

  • make_index_balanced_accuracy: Balance any scoring function according to balanced accuracy

Posted by tomkleijkers on Thu, 16 May 2019 00:53:06 -0700