Dealing with data imbalance in multi label classification task -- multi label synthetic minority over sampling technology (mlsmote)

Keywords: Python Machine Learning Deep Learning

When dealing with classification problems, category imbalance is not only a problem we often encounter, but also a problem that often occurs in actual use scenarios. Category imbalance will bring challenges to the prediction task, and lead to poor prediction effect of a few categories, because the hypothetical scenario of most machine learning algorithms is the premise of category (data) balance.

Original link to this article MLSMOTE

Classification is a supervised learning technology, which classifies target data into categories that have been defined in advance. Most supervised learning methods are based on a formal setting. Their data objects are represented in the form of eigenvectors. Each object uniquely corresponds to a set of disjoint class labels (targets). There are four main categories:

  1. Binary Classification: in the Binary Classification problem, an instance corresponds to one or another label. If a person corresponds to a boy or a girl;
  2. Multi class classification: in a multi class classification problem, the target variable contains more than two different values. For example, a person's age can be divided into children, youth, middle-aged, elderly, etc;
  3. Multi label classification: in the multi label classification task, the target variable has multiple dimensions, and each dimension is binary, that is, it contains only two different values. For example, the classification of film types, if it is a single film, it can be divided into comedy and drama.
  4. Multidimensional classification: it is an extension of the multi class classification mutil class. Each dimension of the target variable is a non binary non binary.

SMOTE

In some classification cases, the number of instances associated with one class is far less than that of another class, which leads to the problem of data imbalance, which greatly affects the performance of our machine learning algorithm. In the case of multi label classification, this problem will also occur due to the uneven distribution of labels. In order to overcome the problem of data imbalance, we use various methods and technologies, and data expansion is one of them. In this paper, we discuss a common unbalanced multi label data expansion method, namely multi label synthetic minority over sampling technology (mlsmote).

MLSMOTE is one of the most popular and effective data enhancement technologies in multi label classification. As the name suggests, it is an extension or variant of SMOTE (synthetic minority over sampling technique). If you are reading this article, I assume you are familiar with SMOTE. Here is a brief introduction:

  • Select the data for sampling (general data with minority labels);
  • Select an instance of the data;
  • Find the k nearest neighbors of this data point;
  • Randomly select a data point adjacent k to the selected data point, and make a composite data point at any position on the straight line connecting the two data points;
  • Repeat this process until the data is balanced.
    For more details about SMOTE, please refer to these two articles:
    SMOTE: Synthetic Minority Over-sampling Technique
    SMOTE and ADASYN (Handling Imbalanced Data Set)

MLSMOTE

As in SMOTE, we provide data and expand it to generate more samples of the same class of selected reference points, but in the multi label setting, it fails because each instance of data is associated with various labels. Therefore, it is possible that one sample containing a few labels may also contain another sample containing a majority of labels, so we must also generate labels for synthetic data. In the multi label setting, we call most labels head labels and a few labels tail labels. You can partition the steps involved in MLSMOTE.

  1. Select the data to expand. In multi tag data, it is more likely that multiple tags are tail tags, so appropriate criteria should be established to select those tags considered to be a few;
  2. Once the data is selected for all tail tag samples, we must generate new data for the feature vector corresponding to these tag data;
  3. Generate a target label for the newly generated data based on all labels associated with the data.

Minority Instance Selection:

In order to generate composite instances, we need some reference points to create data, so we need to select a tail tag data instance before we apply any data enhancement technology. In order to select the tail label, f chart et al. Gave two concepts, which are:

  • Unbalance ratio per label: it is calculated separately for each label.

    among ∣ L ∣ |L| ∣ L ∣ and ∣ N ∣ |N| ∣ N ∣ represents the number of tags and instances (number of samples) respectively.
  • Mean Imbalance ratio: it is defined as the average value of IRPL of all tags.

    Each tag with irpl (L) > Mir is regarded as a tail tag, and all instances containing the data of the tag are regarded as a few instance data.

Feature Vector Generation:

In this step, we know why this algorithm is named MLSMOTE because it uses the same SMOTE algorithm to generate feature vectors for the newly generated data.

Label Set Generation:

In other tail tag data enhancement techniques of multi tag data sets, only the feature vector is enhanced and the target variables of reference data points are cloned. This technique completely ignores the information about label relevance. MLSMOTE proposes three different methods to obtain the advantages of data label related information. These three methods are listed below:

  • Intersection intersection: only the labels on the reference data point and all adjacent data points will be on the composite data point.
  • Union: all labels in the reference data point or any adjacent data point are in the composite data.
  • Ranking: we calculate the number of times each label appears in the reference data point and adjacent data points. Only these labels are considered in the composite data with a frequency of more than half of the considered instances.

Through empirical research, it is proved that the ranking method is the most effective.

MLSMOTE code (Python)

# -*- coding: utf-8 -*-
# Importing required Library
import numpy as np
import pandas as pd
import random
from sklearn.datasets import make_classification
from sklearn.neighbors import NearestNeighbors

def create_dataset(n_sample=1000):
    ''' 
    Create a unevenly distributed sample data set multilabel  
    classification using make_classification function
    
    args
    nsample: int, Number of sample to be created
    
    return
    X: pandas.DataFrame, feature vector dataframe with 10 features 
    y: pandas.DataFrame, target vector dataframe with 5 labels
    '''
    X, y = make_classification(n_classes=5, class_sep=2, 
                           weights=[0.1,0.025, 0.205, 0.008, 0.9], n_informative=3, n_redundant=1, flip_y=0,
                           n_features=10, n_clusters_per_class=1, n_samples=1000, random_state=10)
    y = pd.get_dummies(y, prefix='class')
    return pd.DataFrame(X), y

def get_tail_label(df):
    """
    Give tail label colums of the given target dataframe
    
    args
    df: pandas.DataFrame, target label df whose tail label has to identified
    
    return
    tail_label: list, a list containing column name of all the tail label
    """
    columns = df.columns
    n = len(columns)
    irpl = np.zeros(n)
    for column in range(n):
        irpl[column] = df[columns[column]].value_counts()[1]
    irpl = max(irpl)/irpl
    mir = np.average(irpl)
    tail_label = []
    for i in range(n):
        if irpl[i] > mir:
            tail_label.append(columns[i])
    return tail_label

def get_index(df):
  """
  give the index of all tail_label rows
  args
  df: pandas.DataFrame, target label df from which index for tail label has to identified
    
  return
  index: list, a list containing index number of all the tail label
  """
  tail_labels = get_tail_label(df)
  index = set()
  for tail_label in tail_labels:
    sub_index = set(df[df[tail_label]==1].index)
    index = index.union(sub_index)
  return list(index)

def get_minority_instace(X, y):
    """
    Give minority dataframe containing all the tail labels
    
    args
    X: pandas.DataFrame, the feature vector dataframe
    y: pandas.DataFrame, the target vector dataframe
    
    return
    X_sub: pandas.DataFrame, the feature vector minority dataframe
    y_sub: pandas.DataFrame, the target vector minority dataframe
    """
    index = get_index(y)
    X_sub = X[X.index.isin(index)].reset_index(drop = True)
    y_sub = y[y.index.isin(index)].reset_index(drop = True)
    return X_sub, y_sub

def nearest_neighbour(X):
    """
    Give index of 5 nearest neighbor of all the instance
    
    args
    X: np.array, array whose nearest neighbor has to find
    
    return
    indices: list of list, index of 5 NN of each element in X
    """
    nbs=NearestNeighbors(n_neighbors=5,metric='euclidean',algorithm='kd_tree').fit(X)
    euclidean,indices= nbs.kneighbors(X)
    return indices

def MLSMOTE(X,y, n_sample):
    """
    Give the augmented data using MLSMOTE algorithm
    
    args
    X: pandas.DataFrame, input vector DataFrame
    y: pandas.DataFrame, feature vector dataframe
    n_sample: int, number of newly generated sample
    
    return
    new_X: pandas.DataFrame, augmented feature vector data
    target: pandas.DataFrame, augmented target vector data
    """
    indices2 = nearest_neighbour(X)
    n = len(indices2)
    new_X = np.zeros((n_sample, X.shape[1]))
    target = np.zeros((n_sample, y.shape[1]))
    for i in range(n_sample):
        reference = random.randint(0,n-1)
        neighbour = random.choice(indices2[reference,1:])
        all_point = indices2[reference]
        nn_df = y[y.index.isin(all_point)]
        ser = nn_df.sum(axis = 0, skipna = True)
        target[i] = np.array([1 if val>2 else 0 for val in ser])
        ratio = random.random()
        gap = X.loc[reference,:] - X.loc[neighbour,:]
        new_X[i] = np.array(X.loc[reference,:] + ratio * gap)
    new_X = pd.DataFrame(new_X, columns=X.columns)
    target = pd.DataFrame(target, columns=y.columns)
    new_X = pd.concat([X, new_X], axis=0)
    target = pd.concat([y, target], axis=0)
    return new_X, target

if __name__=='__main__':
    """
    main function to use the MLSMOTE
    """
    X, y = create_dataset()                     #Creating a Dataframe
    X_sub, y_sub = get_minority_instace(X, y)   #Getting minority instance of that datframe
    X_res,y_res =MLSMOTE(X_sub, y_sub, 100)     #Applying MLSMOTE to augment the dataframe

Posted by jacksonpt on Tue, 12 Oct 2021 19:04:53 -0700