Kfold, structured kfold, structured shufflesplit, groupkfold differences and structured group kfold

Keywords: Machine Learning sklearn

5. Differences between kfold, structured kfold, structured shufflesplit and groupkfold and implementation of structured group kfold

In machine learning, we can't take the whole data set directly for training, but use the cross validation method for training. Enhance randomness and reduce noise to reduce over fitting, so as to obtain and learn more comprehensive information from limited data and make the model have strong generalization ability. In sklearn, kfold, structured kfold, structured shufflesplit and groupkfold are often used. Explain the differences one by one. Use a simple df. The df information is shown in the figure. (generally, n_splits=5/10)

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold,\
            StratifiedShuffleSplit, GroupKFold, GroupShuffleSplit
          
          
df2 = pd.DataFrame([[6.5, 1, 2],
            [8, 1, 0],
            [61, 2, 1],
            [54, 0, 1],
            [78, 0, 1],
            [119, 2, 2],
            [111, 1, 2],
            [23, 0, 0],
            [31, 2, 0]], columns=['h', 'w', 'class'])
df2
	h		w class
0	6.5		1	2
1	8.0		1	0
2	61.0 	2	1
3	54.0	0	1
4	78.0	0	1
5	119.0	2	2
6	111.0	1	2
7	23.0	0	0
8	31.0	2	0

1. KFold use

X = df2.drop(['class'], axis=1)
y = df2['class']
floder = KFold(n_splits=3, random_state=2020, shuffle=True)
for train_idx, test_idx in floder.split(X,y):
    print("KFold Spliting:")
    print('Train index: %s | test index: %s' % (train_idx, test_idx))
    # print(X.iloc[train_idx], y.iloc[train_idx], '\n', X.iloc[test_idx], y.iloc[test_idx])
===================================================================
KFold Spliting:
Train index: [0 1 3 5 6 8] | test index: [2 4 7]
KFold Spliting:
Train index: [0 2 3 4 7 8] | test index: [1 5 6]
KFold Spliting:
Train index: [1 2 4 5 6 7] | test index: [0 3 8]

Note that after the partition, the index for the data is obtained. Now we only focus on the test index. We can find that the index obtained by each division is not evenly divided according to the category corresponding to the class. For example, the corresponding category of [2,4,7] for the first time is 1,1,0. In fact, the same is true for the train index, 2,0,1,2,2,0. This does not meet the requirements in many cases, Because we often want the train dataset/valid dataset obtained by each partition, in which the corresponding target category is uniform.

Interestingly, you will n_ Try splits = 8 or 9. You can see the number of different partitions. The number of test indexes is different. Such as n_ When splits = 8, the test index size in the first folder is n_samples // n_splits + 1= 2, the rest is 1.

The first n_samples % n_splits folds have size n_samples // n_splits + 1, other folds have size n_samples // n_splits, where n_samples is the number of samples.

​ - kfold

Now we know that KFold cannot be divided evenly according to the target category. What if the dataset must be divided according to the target category? Then use structured KFold.

2. Use of structured kfold

sfolder = StratifiedKFold(n_splits=3, random_state=2020, shuffle=True)
for train_idx, test_idx in sfolder.split(X,y):
    print("StratifiedKFold Spliting:")
    print('Train index: %s | test index: %s' % (train_idx, test_idx))
    
======================================================
StratifiedKFold Spliting:
Train index: [0 3 4 5 7 8] | test index: [1 2 6]
StratifiedKFold Spliting:
Train index: [1 2 3 5 6 8] | test index: [0 4 7]
StratifiedKFold Spliting:
Train index: [0 1 2 4 6 7] | test index: [3 5 8]

At this time, the first test index we get is [1,2,6], and the train index can also be verified, that is, the target category of the divided dataset is uniform. But there are still some data. For example, if the feature column w in df also represents a category, do we want to group the same category of this feature column? Just like df.groupby. This can be done using GroupKFold

3. Use groupkfold

gfolder = GroupKFold(n_splits=3)
for train_idx, test_idx in gfolder.split(X,y, groups=X['w']):
    print("GroupKFold Spliting:")
    print('Train index: %s | test index: %s' % (train_idx, test_idx))
   
========================================================================
GroupKFold Spliting:
Train index: [0 1 3 4 6 7] | test index: [2 5 8]
GroupKFold Spliting:
Train index: [2 3 4 5 7 8] | test index: [0 1 6]
GroupKFold Spliting:
Train index: [0 1 2 5 6 8] | test index: [3 4 7]

Here, the first test index is [2 5 8], and the corresponding w column is 2. [0 1 6] is 1. In this way, it is divided into groups. You can try groups=y.

4. Use of structured shufflesplit

Structuredshufflesplit is a combination of structuredkfold and ShuffleSplit. The biggest difference between it and structured kfold is that it can be sampled repeatedly. You can see that the first test index is [1.54] and the second is [8.04]. Then it is possible that the indexes of some two folders are the same, not guarantee that all folders will be different.

shuffle_split = StratifiedShuffleSplit(n_splits=3, random_state=2020, test_size=3) #test_size must be larger than the category or can be sampled repeatedly
for train_idx, test_idx in shuffle_split.split(X,y):
    print("StratifiedShuffleSplit Spliting:")
    print('Train index: %s | test index: %s' % (train_idx, test_idx))
====================================================================
StratifiedShuffleSplit Spliting:
Train index: [8 2 3 0 6 7] | test index: [1 5 4]
StratifiedShuffleSplit Spliting:
Train index: [3 1 6 2 7 5] | test index: [8 0 4]
StratifiedShuffleSplit Spliting:
Train index: [1 8 2 6 0 4] | test index: [7 3 5]

5. Implementation of structured group kfold

At present, many data sets are very unbalanced. If it is required to divide evenly according to some feature columns and target columns during training, structured group kfold appears, which can be regarded as GroupKFold and structured kfold.

The following code comes from stratifiedgroupkfold , the dataset is sklearn iris. In addition, add another column of IDS, that is, make groups=df ['ID'] and y in the divided train valid is still the same as the distribution of the original dataset.

import numpy as np
import pandas as pd
import random
from sklearn.model_selection import GroupKFold
from collections import Counter, defaultdict
from sklearn.datasets import load_iris

def read_data():
    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['target'] = iris.target

    #Define a new ID column
    list_id = ['A', 'B', 'C', 'D', 'E']
    df['ID'] = np.random.choice(list_id, len(df))

    features = iris.feature_names
    return df, features

df, features = read_data()
print(df.sample(6))
     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)     target   ID
133                6.3               2.8                5.1               1.5   	2    	C
21                 5.1               3.7                1.5               0.4   	0		A
84                 5.4               3.0                4.5               1.5   	1		A
62                 6.0               2.2                4.0               1.0   	1		D
5                  5.4               3.9                1.7               0.4   	0		B	
132                6.4               2.8                5.6               2.2   	2		E
     

StratiiedGroupKFold decomposition implementation:

def count_y(y, groups):
    """Count each group Inside each y number"""
    unique_num = np.max(y) + 1
    #If the key does not exist, np.zeros(unique_num) is returned by default
    y_counts_per_group = defaultdict(lambda : np.zeros(unique_num))

    for label, g  in zip(y, groups):
        y_counts_per_group[g][label] += 1

    # defaultdict(<function__main__.<lambda>>,
    # {'A': array([5., 9., 8.]),
    # 'B': array([11., 12., 10.]),
    # 'C': array([13., 8., 8.]),
    # 'D': array([9., 11., 11.]),
    # 'E': array([12., 10., 13.])})
    return y_counts_per_group

def StratiiedGroupKFold(X, y, groups, features, k, seed=None):
    """
    StratiiedGroupKFold data yeild Partitioned dataset index
    :param X: data set X
    :param y: y target
    :param groups: Specifies the division of its distribution groups
    :param features: features
    :param k: n_split
    :param seed:
    """
    max_y = np.max(y)
    #Get the statistical Dictionary of the number y of each group
    y_counts_per_group = count_y(y, groups)
    gf = GroupKFold(n_splits=k)
    for train_idx, val_idx in gf.split(X, y, groups):
        #Obtain the divided data of train val and the corresponding number of ID column categories
        x_train = X.iloc[train_idx,:]
        #Number of id column categories
        id_train = x_train['ID'].unique()
        x_train = x_train[features]

        x_val, y_val = X.iloc[val_idx, :], y.iloc[val_idx]
        id_val = x_val['ID'].unique()
        x_val = x_val[features]

        #Count the number of each category in y the training dataset and validation dataset
        y_counts_train = np.zeros(max_y + 1)
        y_counts_val = np.zeros(max_y + 1)
        for id in id_train:
            y_counts_train += y_counts_per_group[id]
        for id in id_val:
            y_counts_val += y_counts_per_group[id]

        #The ratio of y categories to the maximum number is counted by ID column in the train dataset
        numratio_train = y_counts_train / np.max(y_counts_train)
        #Number of qualified: y corresponding to validation dataset_ counts_ Train Max count of index * numratio_train rounded up
        stratified_count = np.ceil(y_counts_val[np.argmax(y_counts_train)] * numratio_train).astype(int)

        val_idx = np.array([])
        np.random.rand(seed)
        for num in range(max_y + 1):
            val_idx = np.append(val_idx, np.random.choice(y_val[y_val==num].index, stratified_count[num]))
        val_idx = val_idx.astype(int)

        yield train_idx, val_idx

See the division effect:

def get_distribution(y_vals):
    """Return y Proportion of each category"""
    y_distribut = Counter(y_vals)
    y_vals_sum = sum(y_distribut.values())
    return [f'{y_distribut[i]/y_vals_sum:.2%}' for i in range(np.max(y_vals) + 1)]

X = df.drop('target', axis=1)
y = df['target']
groups = df['ID']

distribution = [get_distribution(y)]
index = ['all dataset']

#Look at the division
for fold, (train_idx, val_idx) in enumerate(StratiiedGroupKFold(X, y, groups, features, k=3, seed=2020)):
    print(f'Train ID - fold {fold:1d}:{groups[train_idx].unique()}\
       Test ID - fold {fold:1d}:{groups[val_idx].unique()}')

    distribution.append(get_distribution(y[train_idx]))
    index.append(f'train set - fold{fold:1d}')
    distribution.append(get_distribution(y[val_idx]))
    index.append(f'valid set - fold{fold:1d}')
print(pd.DataFrame(distribution, index=index, columns={f' Label{l:2d}' for l in range(np.max(y)+1)}))
Train ID - fold 0:['B' 'A' 'C' 'D']   Test ID - fold 0:['E']
Train ID - fold 1:['A' 'D' 'E']       Test ID - fold 1:['B' 'C']
Train ID - fold 2:['B' 'C' 'E']       Test ID - fold 2:['A' 'D']
                   Label 1  Label 2  Label 0
all dataset         33.33%   33.33%   33.33%
train set - fold0   32.48%   31.62%   35.90%
valid set - fold0   33.33%   33.33%   33.33%
train set - fold1   34.44%   33.33%   32.22%
valid set - fold1   33.93%   33.93%   32.14%
train set - fold2   33.33%   35.48%   31.18%
valid set - fold2   33.33%   35.42%   31.25%

General implementation:

def stratified_group_k_fold(X, y, groups, k, seed=None):
    labels_num = np.max(y) + 1
    y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
    y_distr = Counter()
    for label, g in zip(y, groups):
        y_counts_per_group[g][label] += 1
        y_distr[label] += 1

    y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
    groups_per_fold = defaultdict(set)

    def eval_y_counts_per_fold(y_counts, fold):
        y_counts_per_fold[fold] += y_counts
        std_per_label = []
        for label in range(labels_num):
            label_std = np.std([y_counts_per_fold[i][label] / y_distr[label] for i in range(k)])
            std_per_label.append(label_std)
        y_counts_per_fold[fold] -= y_counts
        return np.mean(std_per_label)

    groups_and_y_counts = list(y_counts_per_group.items())
    random.Random(seed).shuffle(groups_and_y_counts)

    for g, y_counts in sorted(groups_and_y_counts, key=lambda x: -np.std(x[1])):
        best_fold = None
        min_eval = None
        for i in range(k):
            fold_eval = eval_y_counts_per_fold(y_counts, i)
            if min_eval is None or fold_eval < min_eval:
                min_eval = fold_eval
                best_fold = i
        y_counts_per_fold[best_fold] += y_counts
        groups_per_fold[best_fold].add(g)

    all_groups = set(groups)
    for i in range(k):
        train_groups = all_groups - groups_per_fold[i]
        test_groups = groups_per_fold[i]

        train_indices = [i for i, g in enumerate(groups) if g in train_groups]
        test_indices = [i for i, g in enumerate(groups) if g in test_groups]

        yield train_indices, test_indices

Let's stop here and take a look at the sample method and unbalanced data use when we have time.

Inference

[1] StratifiedKFold v.s KFold v.s StratifiedShuffleSplit

[2] sampling

[3] imbalanced-learn

Posted by vishakh369 on Sat, 11 Sep 2021 10:18:52 -0700