5. Differences between kfold, structured kfold, structured shufflesplit and groupkfold and implementation of structured group kfold
In machine learning, we can't take the whole data set directly for training, but use the cross validation method for training. Enhance randomness and reduce noise to reduce over fitting, so as to obtain and learn more comprehensive information from limited data and make the model have strong generalization ability. In sklearn, kfold, structured kfold, structured shufflesplit and groupkfold are often used. Explain the differences one by one. Use a simple df. The df information is shown in the figure. (generally, n_splits=5/10)
import pandas as pd import numpy as np from sklearn.model_selection import KFold, StratifiedKFold,\ StratifiedShuffleSplit, GroupKFold, GroupShuffleSplit df2 = pd.DataFrame([[6.5, 1, 2], [8, 1, 0], [61, 2, 1], [54, 0, 1], [78, 0, 1], [119, 2, 2], [111, 1, 2], [23, 0, 0], [31, 2, 0]], columns=['h', 'w', 'class']) df2
h w class 0 6.5 1 2 1 8.0 1 0 2 61.0 2 1 3 54.0 0 1 4 78.0 0 1 5 119.0 2 2 6 111.0 1 2 7 23.0 0 0 8 31.0 2 0
1. KFold use
X = df2.drop(['class'], axis=1) y = df2['class'] floder = KFold(n_splits=3, random_state=2020, shuffle=True) for train_idx, test_idx in floder.split(X,y): print("KFold Spliting:") print('Train index: %s | test index: %s' % (train_idx, test_idx)) # print(X.iloc[train_idx], y.iloc[train_idx], '\n', X.iloc[test_idx], y.iloc[test_idx]) =================================================================== KFold Spliting: Train index: [0 1 3 5 6 8] | test index: [2 4 7] KFold Spliting: Train index: [0 2 3 4 7 8] | test index: [1 5 6] KFold Spliting: Train index: [1 2 4 5 6 7] | test index: [0 3 8]
Note that after the partition, the index for the data is obtained. Now we only focus on the test index. We can find that the index obtained by each division is not evenly divided according to the category corresponding to the class. For example, the corresponding category of [2,4,7] for the first time is 1,1,0. In fact, the same is true for the train index, 2,0,1,2,2,0. This does not meet the requirements in many cases, Because we often want the train dataset/valid dataset obtained by each partition, in which the corresponding target category is uniform.
Interestingly, you will n_ Try splits = 8 or 9. You can see the number of different partitions. The number of test indexes is different. Such as n_ When splits = 8, the test index size in the first folder is n_samples // n_splits + 1= 2, the rest is 1.
The first n_samples % n_splits folds have size n_samples // n_splits + 1, other folds have size n_samples // n_splits, where n_samples is the number of samples.
- kfold
Now we know that KFold cannot be divided evenly according to the target category. What if the dataset must be divided according to the target category? Then use structured KFold.
2. Use of structured kfold
sfolder = StratifiedKFold(n_splits=3, random_state=2020, shuffle=True) for train_idx, test_idx in sfolder.split(X,y): print("StratifiedKFold Spliting:") print('Train index: %s | test index: %s' % (train_idx, test_idx)) ====================================================== StratifiedKFold Spliting: Train index: [0 3 4 5 7 8] | test index: [1 2 6] StratifiedKFold Spliting: Train index: [1 2 3 5 6 8] | test index: [0 4 7] StratifiedKFold Spliting: Train index: [0 1 2 4 6 7] | test index: [3 5 8]
At this time, the first test index we get is [1,2,6], and the train index can also be verified, that is, the target category of the divided dataset is uniform. But there are still some data. For example, if the feature column w in df also represents a category, do we want to group the same category of this feature column? Just like df.groupby. This can be done using GroupKFold
3. Use groupkfold
gfolder = GroupKFold(n_splits=3) for train_idx, test_idx in gfolder.split(X,y, groups=X['w']): print("GroupKFold Spliting:") print('Train index: %s | test index: %s' % (train_idx, test_idx)) ======================================================================== GroupKFold Spliting: Train index: [0 1 3 4 6 7] | test index: [2 5 8] GroupKFold Spliting: Train index: [2 3 4 5 7 8] | test index: [0 1 6] GroupKFold Spliting: Train index: [0 1 2 5 6 8] | test index: [3 4 7]
Here, the first test index is [2 5 8], and the corresponding w column is 2. [0 1 6] is 1. In this way, it is divided into groups. You can try groups=y.
4. Use of structured shufflesplit
Structuredshufflesplit is a combination of structuredkfold and ShuffleSplit. The biggest difference between it and structured kfold is that it can be sampled repeatedly. You can see that the first test index is [1.54] and the second is [8.04]. Then it is possible that the indexes of some two folders are the same, not guarantee that all folders will be different.
shuffle_split = StratifiedShuffleSplit(n_splits=3, random_state=2020, test_size=3) #test_size must be larger than the category or can be sampled repeatedly for train_idx, test_idx in shuffle_split.split(X,y): print("StratifiedShuffleSplit Spliting:") print('Train index: %s | test index: %s' % (train_idx, test_idx)) ==================================================================== StratifiedShuffleSplit Spliting: Train index: [8 2 3 0 6 7] | test index: [1 5 4] StratifiedShuffleSplit Spliting: Train index: [3 1 6 2 7 5] | test index: [8 0 4] StratifiedShuffleSplit Spliting: Train index: [1 8 2 6 0 4] | test index: [7 3 5]
5. Implementation of structured group kfold
At present, many data sets are very unbalanced. If it is required to divide evenly according to some feature columns and target columns during training, structured group kfold appears, which can be regarded as GroupKFold and structured kfold.
The following code comes from stratifiedgroupkfold , the dataset is sklearn iris. In addition, add another column of IDS, that is, make groups=df ['ID'] and y in the divided train valid is still the same as the distribution of the original dataset.
import numpy as np import pandas as pd import random from sklearn.model_selection import GroupKFold from collections import Counter, defaultdict from sklearn.datasets import load_iris def read_data(): iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) df['target'] = iris.target #Define a new ID column list_id = ['A', 'B', 'C', 'D', 'E'] df['ID'] = np.random.choice(list_id, len(df)) features = iris.feature_names return df, features df, features = read_data() print(df.sample(6))
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target ID 133 6.3 2.8 5.1 1.5 2 C 21 5.1 3.7 1.5 0.4 0 A 84 5.4 3.0 4.5 1.5 1 A 62 6.0 2.2 4.0 1.0 1 D 5 5.4 3.9 1.7 0.4 0 B 132 6.4 2.8 5.6 2.2 2 E
StratiiedGroupKFold decomposition implementation:
def count_y(y, groups): """Count each group Inside each y number""" unique_num = np.max(y) + 1 #If the key does not exist, np.zeros(unique_num) is returned by default y_counts_per_group = defaultdict(lambda : np.zeros(unique_num)) for label, g in zip(y, groups): y_counts_per_group[g][label] += 1 # defaultdict(<function__main__.<lambda>>, # {'A': array([5., 9., 8.]), # 'B': array([11., 12., 10.]), # 'C': array([13., 8., 8.]), # 'D': array([9., 11., 11.]), # 'E': array([12., 10., 13.])}) return y_counts_per_group def StratiiedGroupKFold(X, y, groups, features, k, seed=None): """ StratiiedGroupKFold data yeild Partitioned dataset index :param X: data set X :param y: y target :param groups: Specifies the division of its distribution groups :param features: features :param k: n_split :param seed: """ max_y = np.max(y) #Get the statistical Dictionary of the number y of each group y_counts_per_group = count_y(y, groups) gf = GroupKFold(n_splits=k) for train_idx, val_idx in gf.split(X, y, groups): #Obtain the divided data of train val and the corresponding number of ID column categories x_train = X.iloc[train_idx,:] #Number of id column categories id_train = x_train['ID'].unique() x_train = x_train[features] x_val, y_val = X.iloc[val_idx, :], y.iloc[val_idx] id_val = x_val['ID'].unique() x_val = x_val[features] #Count the number of each category in y the training dataset and validation dataset y_counts_train = np.zeros(max_y + 1) y_counts_val = np.zeros(max_y + 1) for id in id_train: y_counts_train += y_counts_per_group[id] for id in id_val: y_counts_val += y_counts_per_group[id] #The ratio of y categories to the maximum number is counted by ID column in the train dataset numratio_train = y_counts_train / np.max(y_counts_train) #Number of qualified: y corresponding to validation dataset_ counts_ Train Max count of index * numratio_train rounded up stratified_count = np.ceil(y_counts_val[np.argmax(y_counts_train)] * numratio_train).astype(int) val_idx = np.array([]) np.random.rand(seed) for num in range(max_y + 1): val_idx = np.append(val_idx, np.random.choice(y_val[y_val==num].index, stratified_count[num])) val_idx = val_idx.astype(int) yield train_idx, val_idx
See the division effect:
def get_distribution(y_vals): """Return y Proportion of each category""" y_distribut = Counter(y_vals) y_vals_sum = sum(y_distribut.values()) return [f'{y_distribut[i]/y_vals_sum:.2%}' for i in range(np.max(y_vals) + 1)] X = df.drop('target', axis=1) y = df['target'] groups = df['ID'] distribution = [get_distribution(y)] index = ['all dataset'] #Look at the division for fold, (train_idx, val_idx) in enumerate(StratiiedGroupKFold(X, y, groups, features, k=3, seed=2020)): print(f'Train ID - fold {fold:1d}:{groups[train_idx].unique()}\ Test ID - fold {fold:1d}:{groups[val_idx].unique()}') distribution.append(get_distribution(y[train_idx])) index.append(f'train set - fold{fold:1d}') distribution.append(get_distribution(y[val_idx])) index.append(f'valid set - fold{fold:1d}') print(pd.DataFrame(distribution, index=index, columns={f' Label{l:2d}' for l in range(np.max(y)+1)}))
Train ID - fold 0:['B' 'A' 'C' 'D'] Test ID - fold 0:['E'] Train ID - fold 1:['A' 'D' 'E'] Test ID - fold 1:['B' 'C'] Train ID - fold 2:['B' 'C' 'E'] Test ID - fold 2:['A' 'D'] Label 1 Label 2 Label 0 all dataset 33.33% 33.33% 33.33% train set - fold0 32.48% 31.62% 35.90% valid set - fold0 33.33% 33.33% 33.33% train set - fold1 34.44% 33.33% 32.22% valid set - fold1 33.93% 33.93% 32.14% train set - fold2 33.33% 35.48% 31.18% valid set - fold2 33.33% 35.42% 31.25%
def stratified_group_k_fold(X, y, groups, k, seed=None): labels_num = np.max(y) + 1 y_counts_per_group = defaultdict(lambda: np.zeros(labels_num)) y_distr = Counter() for label, g in zip(y, groups): y_counts_per_group[g][label] += 1 y_distr[label] += 1 y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num)) groups_per_fold = defaultdict(set) def eval_y_counts_per_fold(y_counts, fold): y_counts_per_fold[fold] += y_counts std_per_label = [] for label in range(labels_num): label_std = np.std([y_counts_per_fold[i][label] / y_distr[label] for i in range(k)]) std_per_label.append(label_std) y_counts_per_fold[fold] -= y_counts return np.mean(std_per_label) groups_and_y_counts = list(y_counts_per_group.items()) random.Random(seed).shuffle(groups_and_y_counts) for g, y_counts in sorted(groups_and_y_counts, key=lambda x: -np.std(x[1])): best_fold = None min_eval = None for i in range(k): fold_eval = eval_y_counts_per_fold(y_counts, i) if min_eval is None or fold_eval < min_eval: min_eval = fold_eval best_fold = i y_counts_per_fold[best_fold] += y_counts groups_per_fold[best_fold].add(g) all_groups = set(groups) for i in range(k): train_groups = all_groups - groups_per_fold[i] test_groups = groups_per_fold[i] train_indices = [i for i, g in enumerate(groups) if g in train_groups] test_indices = [i for i, g in enumerate(groups) if g in test_groups] yield train_indices, test_indices
Let's stop here and take a look at the sample method and unbalanced data use when we have time.
Inference
[1] StratifiedKFold v.s KFold v.s StratifiedShuffleSplit
[2] sampling
[3] imbalanced-learn