1. Combination of over-and under-sampling
The main solution is to generate noise samples in the SMOTE algorithm, and the solution is to clean the space resulting from over-sampling.
The main idea is to use SMOTE for up-sampling first, and then get one by Tomek's link or edited nearest-neighbours method.
The corresponding functions are: SMOTETomek and SMOTTEENN.
from imblearn.combine import SMOTEENN smote_enn = SMOTEENN(random_state=0) X_resampled, y_resampled = smote_enn.fit_resample(X, y) from imblearn.combine import SMOTETomek smote_tomek = SMOTETomek(random_state=0) X_resampled, y_resampled = smote_tomek.fit_resample(X, y)
2. Ensemble of samplers
2.1 Bagging classifier
**Bagging:**There are different subsets of the sample generated from the playback sample, and then classifiers are established on each subset (to give the classifier type).
In scikit-learn s, there are classes BaggingClassifier, but for unbalanced data, it is not guaranteed that the data in each subset is balanced, so the classification results tend to favor most classes.
In imblearn, the class BalaceBaggingClassifier causes resampling on each subset before each classifier is trained, with the same parameters as BaggingClassifier in sklearn, except that two parameters are added: sampling_strategy and replacement to control how random downsampling occurs.
from imblearn.ensemble import BalancedBaggingClassifier from sklearn.metrics import balanced_accuracy_score bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(), sampling_strategy='auto', replacement=False, random_state=0) bbc.fit(X_train, y_train) y_pred =bbc.predict(X_test) balanced_accuracy_score(y_test, y_pred)#Calculate Balance Accuracy
2.2 Forest of randomized trees
Use a balanced subset of bootstrap data when building each tree.
from imblearn.ensemble import BalancedRandomForestClassifier brf = BalancedRandomForestClassifier(n_estimators=100,random_state=0) brf.fit(X_train, y_train)
2.3 Boosting
Training n weak classifiers on a subset of the dataset, weighting the N weak classifiers and fusing them to produce a classifier with the final result.
2.3.1 RUSBoostClassifier
Perform a random downsampling before executing the boosting iteration.
from imblearn.ensemble import RUSBoostClassifier rusboost = RUSBoostClassifier(random_state=0) rusboost.fit(X_train, y_train)
2.3.2 EasyEnsembleClassifier, using Adaboost
Calculates the error rate of a weak classifier, assigns larger weights to the incorrectly classified samples, and assigns smaller weights to the correctly classified samples.As long as the classification accuracy is greater than 0.5, it can be a member of the final classifier. The higher the accuracy of the weak classifier, the greater the weight.
from imblearn.ensemble import EasyEnsembleClassifier eec = EasyEnsembleClassifier(random_state=0) eec.fit(X_train, y_train)
3. Miscellaneous samplers
3.1 Custom sampler: FunctionSampler
from imblearn import FunctionSampler def fuc(X, y): return X[:10], y[:10] sampler = FunctionSampler(func=func) X_res, y_res = sampler.fit_resample(X, y)
3.2 Custom generators (mini-batches balanced for TensorFlow and Keras)
3.2.1 Tensorflow generator: imblearn.tensorflow.balanced_batch_generator
import numpy as np X = X.astype(np.float32) from imblearn.under_sampling import RandomUnderSampler from imblearn.tensorflow import balanced_batch_generator training_generator, steps_per_epoch = balanced_batch_generator( X, y, sample_weight=None, sampler=RandomUnderSampler(), batch_size=10, random_state=42) #How to use training_generator and steps_per_epoch: learning_rate, epochs = 0.01, 10 input_size, output_size = X.shape[1], 3 import tensorflow as tf def init_weights(shape): return tf.Variable(tf.random_normal(shape, stddev=0.01)) def accuracy(y_true, y_pred): return np.mean(np.argmax(y_pred, axis=1) == y_true) # input and output data = tf.placeholder("float32", shape=[None, input_size]) targets = tf.placeholder("int32", shape=[None]) # build the model and weights W = init_weights([input_size, output_size]) b = init_weights([output_size]) out_act = tf.nn.sigmoid(tf.matmul(data, W) + b) # build the loss, predict, and train operator cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits( logits=out_act, labels=targets) loss = tf.reduce_sum(cross_entropy) optimizer = tf.train.GradientDescentOptimizer(learning_rate) train_op = optimizer.minimize(loss) predict = tf.nn.softmax(out_act) # Initialization of all variables in the graph init = tf.global_variables_initializer() with tf.Session() as sess: print('Starting training') sess.run(init) for e in range(epochs): for i in range(steps_per_epoch): ##Mainly here X_batch, y_batch = next(training_generator) ##Mainly here sess.run([train_op, loss], feed_dict={data: X_batch, targets: y_batch}) # For each epoch, run accuracy on train and test feed_dict = dict() predicts_train = sess.run(predict, feed_dict={data: X}) print("epoch: {} train accuracy: {:.3f}" .format(e, accuracy(y, predicts_train)))
3.2 Keras generator
##Define a Logistic Regression Model import keras y = keras.utils.to_categorical(y, 3) model = keras.Sequential() model.add(keras.layers.Dense(y.shape[1], input_dim=X.shape[1], activation='softmax')) model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy']) ##keras.balanced_batch_generator generates balanced min-batch from imblearn.keras import balanced_batch_generator training_generator, steps_per_epoch = balanced_batch_generator( X, y, sampler=RandomUnderSampler(), batch_size=10, random_state=42) ##Or use keras.BalancedBatchGenerator from imblearn.keras import BalancedBatchGenerator training_generator = BalancedBatchGenerator( X, y, sampler=RandomUnderSampler(), batch_size=10, random_state=42) callback_history = model.fit_generator(generator=training_generator, epochs=10, verbose=0)
4. Metrics (Metrics)
Currently, sklearns only measures sklearn.metrics.balanced_accuracy_score for unbalanced data.
imblearn.metrics provides two other metrics for evaluating the quality of classifiers
4.1 Sensitivity and specificity metrics
- Sensitivity: true positive rate is recall.
-
Specificity: true negative rate.
So three measures have been added - Sensitivity_s pecificity_support: Output sensitivity and specificficity and support
- sensitivity_score
- specificity_score
4.2 Additional metrics specific to imbalanced datasets
A measure designed to increase unbalanced data
- geometric_mean_score: Calculates the geometric mean (G-mean, the square of the various sensitivity products) as follows:
The The geometric mean (G-mean) is the root of the product of class-wise sensitivity. This measure tries to maximize the accuracy on each of the classes while keeping these accuracies balanced. For binary classification G-mean is the squared root of the product of the sensitivity and specificity. For multi-class problems it is a higher root of the product of sensitivity for each class.
- make_index_balanced_accuracy: Balance any scoring function according to balanced accuracy