Today, I will start to run the distributed machine learning paper experiment. Here I will introduce the common data sets of the paper (because my research field is distributed machine learning, the data sets listed below may be biased towards this aspect. Just refer to children's shoes in other directions).
1. CV dataset
(1)FEMINIST
Task: handwritten character recognition
Parameter Description: pixel pictures of 62 different character categories (10 numbers, 26 lowercase and 26 uppercase). The pictures are all 28 by 28 pixels (you can choose to convert them to 128) × 128), sample number 805263.
Introduction: the full name of FEMNIST dataset is federated MNIST, which is one of the members of the benchmark dataset leaf for federal learning.
Official website: https://leaf.cmu.edu/
Reference method: S Caldas, LEAF: A Benchmark for Federated Settings, 2018
Acquisition method: script acquisition
wget https://s3.amazonaws.com/nist-srd/SD19/by_class.zip wget https://s3.amazonaws.com/nist-srd/SD19/by_write.zip
(2)EMINIST
Task: handwritten character recognition
Parameter Description: if split in byclass mode, it is a pixel picture of 62 different character categories (the number of each category is uneven) (10 numbers, 26 lowercase and 26 uppercase). The pictures are all 28 by 28 pixels, and the number of samples is 814255.
Introduction: the full name of EMNIST dataset is extension of MNIST, which is an extended version of MINIST dataset.
Official website: https://www.nist.gov/itl/products-and-services/emnist-dataset
Reference: Cohen G, EMNIST: an extension of MNIST to handwritten letters, 2017
Acquisition method: it can be obtained by script
wget https://www.itl.nist.gov/iaui/vip/cs_links/EMNIST/gzip.zip
It can also be used directly out of the box from torchvision
from torchvision.datasets import EMNIST from torchvision.transforms import Compose, ToTensor, Normalize RAW_DATA_PATH = './rawdata' transform = Compose( [ToTensor(), Normalize((0.1307,), (0.3081,)) ] ) dataset = EMNIST( root=RAW_DATA_PATH, split="byclass", download=True, train=True, transform=transform )
(3)CIFAR10
Task: image classification
Parameter Description: 10 kinds of 32x32 color pictures (including people, animals, flowers, insects, etc.), 6000 pictures in each category, 50000 training pictures and 10000 test pictures
Introduction: CIFAR-10 is a labeled subset of the so-called 80 million micro image data set.
Official website: https://www.cs.toronto.edu/~kriz/cifar.html
Reference: Alex Krizhevsky, Learning Multiple Layers of Features from Tiny Images, 2009
Acquisition method:
Out of the box directly from torchvision
from torchvision.datasets import CIFAR10 from torchvision.transforms import Compose, ToTensor, Normalize RAW_DATA_PATH = './rawdata' transform = Compose([ ToTensor(), Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)) ]) dataset = CIFAR10( root=RAW_DATA_PATH, download=True, train=True, transform=transform )
(3)CIFAR100
Task: image classification
Parameter Description: 100 kinds of 32x32 color pictures (including people, animals, flowers, insects, etc.), 600 pictures for each category, 500 training pictures and 100 test pictures
Introduction: the brother of CIFAR-10 is also a labeled subset of the so-called 80 million micro image data set.
Official website: https://www.cs.toronto.edu/~kriz/cifar.html
Reference: Alex Krizhevsky, Learning Multiple Layers of Features from Tiny Images, 2009
Acquisition method:
Out of the box directly from torchvision
from torchvision.datasets import CIFAR100 from torchvision.transforms import Compose, ToTensor, Normalize RAW_DATA_PATH = './rawdata' transform = Compose([ ToTensor(), Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)) ]) dataset = CIFAR100( root=RAW_DATA_PATH, download=True, train=True, transform=transform )
2. NLP dataset
(1)Shakespeare
Task: next character prediction
Parameter Description: 4226,15 samples in total
Introduction: like FEMNST, it is one of the members of the benchmark dataset leaf dedicated to federal learning.
Official website: https://leaf.cmu.edu/
Reference method: LEAF: A Benchmark for Federated Settings
Acquisition method:
Get with script
wget http://www.gutenberg.org/files/100/old/1994-01-100.zip
3. General regression / classification
(1)Synthetic
Task: two categories
Parameter Description: you can customize the number of distributed nodes, categories and dimensions
Introduction: this data set provides a method to generate artificial but challenging federated learning data sets. Our goal is to make the models on distributed nodes as independent as possible. The generation process of data set is given in detail. Like FEMNST, it is a member of the benchmark dataset leaf dedicated to federal learning.
Official website: https://leaf.cmu.edu/
Reference method: LEAF: A Benchmark for Federated Settings
Acquisition method:
The dataset needs to be manually generated according to the following python code
from scipy.special import softmax NUM_DIM = 10 class SyntheticDataset: def __init__( self, num_classes=2, seed=931231, num_dim=NUM_DIM, prob_clusters=[0.5, 0.5]): np.random.seed(seed) self.num_classes = num_classes self.num_dim = num_dim self.num_clusters = len(prob_clusters) self.prob_clusters = prob_clusters self.side_info_dim = self.num_clusters self.Q = np.random.normal( loc=0.0, scale=1.0, size=(self.num_dim + 1, self.num_classes, self.side_info_dim)) self.Sigma = np.zeros((self.num_dim, self.num_dim)) for i in range(self.num_dim): self.Sigma[i, i] = (i + 1)**(-1.2) self.means = self._generate_clusters() def get_task(self, num_samples): cluster_idx = np.random.choice( range(self.num_clusters), size=None, replace=True, p=self.prob_clusters) new_task = self._generate_task(self.means[cluster_idx], cluster_idx, num_samples) return new_task def _generate_clusters(self): means = [] for i in range(self.num_clusters): loc = np.random.normal(loc=0, scale=1., size=None) mu = np.random.normal(loc=loc, scale=1., size=self.side_info_dim) means.append(mu) return means def _generate_x(self, num_samples): B = np.random.normal(loc=0.0, scale=1.0, size=None) loc = np.random.normal(loc=B, scale=1.0, size=self.num_dim) samples = np.ones((num_samples, self.num_dim + 1)) samples[:, 1:] = np.random.multivariate_normal( mean=loc, cov=self.Sigma, size=num_samples) return samples def _generate_y(self, x, cluster_mean): model_info = np.random.normal(loc=cluster_mean, scale=0.1, size=cluster_mean.shape) w = np.matmul(self.Q, model_info) num_samples = x.shape[0] prob = softmax(np.matmul(x, w) + np.random.normal(loc=0., scale=0.1, size=(num_samples, self.num_classes)), axis=1) y = np.argmax(prob, axis=1) return y, w, model_info def _generate_task(self, cluster_mean, cluster_id, num_samples): x = self._generate_x(num_samples) y, w, model_info = self._generate_y(x, cluster_mean) # now that we have y, we can remove the bias coeff x = x[:, 1:] return {'x': x, 'y': y, 'w': w, 'model_info': model_info, 'cluster': cluster_id}