Today, I will start to run the distributed machine learning paper experiment. Here I will introduce the common data sets of the paper (because my research field is distributed machine learning, the data sets listed below may be biased towards this aspect. Just refer to children's shoes in other directions).

## 1. CV dataset

### (1)FEMINIST

Task: handwritten character recognition

Parameter Description: pixel pictures of 62 different character categories (10 numbers, 26 lowercase and 26 uppercase). The pictures are all 28 by 28 pixels (you can choose to convert them to 128) × 128), sample number 805263.

Introduction: the full name of FEMNIST dataset is federated MNIST, which is one of the members of the benchmark dataset leaf for federal learning.

Official website: https://leaf.cmu.edu/

Reference method: S Caldas, LEAF: A Benchmark for Federated Settings, 2018

Acquisition method: script acquisition

wget https://s3.amazonaws.com/nist-srd/SD19/by_class.zip wget https://s3.amazonaws.com/nist-srd/SD19/by_write.zip

### (2)EMINIST

Task: handwritten character recognition

Parameter Description: if split in byclass mode, it is a pixel picture of 62 different character categories (the number of each category is uneven) (10 numbers, 26 lowercase and 26 uppercase). The pictures are all 28 by 28 pixels, and the number of samples is 814255.

Introduction: the full name of EMNIST dataset is extension of MNIST, which is an extended version of MINIST dataset.

Official website: https://www.nist.gov/itl/products-and-services/emnist-dataset

Reference: Cohen G, EMNIST: an extension of MNIST to handwritten letters, 2017

Acquisition method: it can be obtained by script

wget https://www.itl.nist.gov/iaui/vip/cs_links/EMNIST/gzip.zip

It can also be used directly out of the box from torchvision

from torchvision.datasets import EMNIST from torchvision.transforms import Compose, ToTensor, Normalize RAW_DATA_PATH = './rawdata' transform = Compose( [ToTensor(), Normalize((0.1307,), (0.3081,)) ] ) dataset = EMNIST( root=RAW_DATA_PATH, split="byclass", download=True, train=True, transform=transform )

### (3)CIFAR10

Task: image classification

Parameter Description: 10 kinds of 32x32 color pictures (including people, animals, flowers, insects, etc.), 6000 pictures in each category, 50000 training pictures and 10000 test pictures

Introduction: CIFAR-10 is a labeled subset of the so-called 80 million micro image data set.

Official website: https://www.cs.toronto.edu/~kriz/cifar.html

Reference: Alex Krizhevsky, Learning Multiple Layers of Features from Tiny Images, 2009

Acquisition method:

Out of the box directly from torchvision

from torchvision.datasets import CIFAR10 from torchvision.transforms import Compose, ToTensor, Normalize RAW_DATA_PATH = './rawdata' transform = Compose([ ToTensor(), Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)) ]) dataset = CIFAR10( root=RAW_DATA_PATH, download=True, train=True, transform=transform )

### (3)CIFAR100

Task: image classification

Parameter Description: 100 kinds of 32x32 color pictures (including people, animals, flowers, insects, etc.), 600 pictures for each category, 500 training pictures and 100 test pictures

Introduction: the brother of CIFAR-10 is also a labeled subset of the so-called 80 million micro image data set.

Official website: https://www.cs.toronto.edu/~kriz/cifar.html

Reference: Alex Krizhevsky, Learning Multiple Layers of Features from Tiny Images, 2009

Acquisition method:

Out of the box directly from torchvision

from torchvision.datasets import CIFAR100 from torchvision.transforms import Compose, ToTensor, Normalize RAW_DATA_PATH = './rawdata' transform = Compose([ ToTensor(), Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)) ]) dataset = CIFAR100( root=RAW_DATA_PATH, download=True, train=True, transform=transform )

## 2. NLP dataset

### (1)Shakespeare

Task: next character prediction

Parameter Description: 4226,15 samples in total

Introduction: like FEMNST, it is one of the members of the benchmark dataset leaf dedicated to federal learning.

Official website: https://leaf.cmu.edu/

Reference method: LEAF: A Benchmark for Federated Settings

Acquisition method:

Get with script

wget http://www.gutenberg.org/files/100/old/1994-01-100.zip

## 3. General regression / classification

### (1)Synthetic

Task: two categories

Parameter Description: you can customize the number of distributed nodes, categories and dimensions

Introduction: this data set provides a method to generate artificial but challenging federated learning data sets. Our goal is to make the models on distributed nodes as independent as possible. The generation process of data set is given in detail. Like FEMNST, it is a member of the benchmark dataset leaf dedicated to federal learning.

Official website: https://leaf.cmu.edu/

Reference method: LEAF: A Benchmark for Federated Settings

Acquisition method:

The dataset needs to be manually generated according to the following python code

from scipy.special import softmax NUM_DIM = 10 class SyntheticDataset: def __init__( self, num_classes=2, seed=931231, num_dim=NUM_DIM, prob_clusters=[0.5, 0.5]): np.random.seed(seed) self.num_classes = num_classes self.num_dim = num_dim self.num_clusters = len(prob_clusters) self.prob_clusters = prob_clusters self.side_info_dim = self.num_clusters self.Q = np.random.normal( loc=0.0, scale=1.0, size=(self.num_dim + 1, self.num_classes, self.side_info_dim)) self.Sigma = np.zeros((self.num_dim, self.num_dim)) for i in range(self.num_dim): self.Sigma[i, i] = (i + 1)**(-1.2) self.means = self._generate_clusters() def get_task(self, num_samples): cluster_idx = np.random.choice( range(self.num_clusters), size=None, replace=True, p=self.prob_clusters) new_task = self._generate_task(self.means[cluster_idx], cluster_idx, num_samples) return new_task def _generate_clusters(self): means = [] for i in range(self.num_clusters): loc = np.random.normal(loc=0, scale=1., size=None) mu = np.random.normal(loc=loc, scale=1., size=self.side_info_dim) means.append(mu) return means def _generate_x(self, num_samples): B = np.random.normal(loc=0.0, scale=1.0, size=None) loc = np.random.normal(loc=B, scale=1.0, size=self.num_dim) samples = np.ones((num_samples, self.num_dim + 1)) samples[:, 1:] = np.random.multivariate_normal( mean=loc, cov=self.Sigma, size=num_samples) return samples def _generate_y(self, x, cluster_mean): model_info = np.random.normal(loc=cluster_mean, scale=0.1, size=cluster_mean.shape) w = np.matmul(self.Q, model_info) num_samples = x.shape[0] prob = softmax(np.matmul(x, w) + np.random.normal(loc=0., scale=0.1, size=(num_samples, self.num_classes)), axis=1) y = np.argmax(prob, axis=1) return y, w, model_info def _generate_task(self, cluster_mean, cluster_id, num_samples): x = self._generate_x(num_samples) y, w, model_info = self._generate_y(x, cluster_mean) # now that we have y, we can remove the bias coeff x = x[:, 1:] return {'x': x, 'y': y, 'w': w, 'model_info': model_info, 'cluster': cluster_id}