Common data sets of distributed machine learning

Keywords: Machine Learning

Today, I will start to run the distributed machine learning paper experiment. Here I will introduce the common data sets of the paper (because my research field is distributed machine learning, the data sets listed below may be biased towards this aspect. Just refer to children's shoes in other directions).

1. CV dataset

(1)FEMINIST

Task: handwritten character recognition
Parameter Description: pixel pictures of 62 different character categories (10 numbers, 26 lowercase and 26 uppercase). The pictures are all 28 by 28 pixels (you can choose to convert them to 128) × 128), sample number 805263.
Introduction: the full name of FEMNIST dataset is federated MNIST, which is one of the members of the benchmark dataset leaf for federal learning.
Official website: https://leaf.cmu.edu/
Reference method: S Caldas, LEAF: A Benchmark for Federated Settings, 2018
Acquisition method: script acquisition

wget https://s3.amazonaws.com/nist-srd/SD19/by_class.zip
wget https://s3.amazonaws.com/nist-srd/SD19/by_write.zip

(2)EMINIST

Task: handwritten character recognition
Parameter Description: if split in byclass mode, it is a pixel picture of 62 different character categories (the number of each category is uneven) (10 numbers, 26 lowercase and 26 uppercase). The pictures are all 28 by 28 pixels, and the number of samples is 814255.
Introduction: the full name of EMNIST dataset is extension of MNIST, which is an extended version of MINIST dataset.
Official website: https://www.nist.gov/itl/products-and-services/emnist-dataset
Reference: Cohen G, EMNIST: an extension of MNIST to handwritten letters, 2017
Acquisition method: it can be obtained by script

wget https://www.itl.nist.gov/iaui/vip/cs_links/EMNIST/gzip.zip

It can also be used directly out of the box from torchvision

from torchvision.datasets import EMNIST
from torchvision.transforms import Compose, ToTensor, Normalize
RAW_DATA_PATH = './rawdata'
transform = Compose(
        [ToTensor(),
         Normalize((0.1307,), (0.3081,))
         ]
    )
dataset = EMNIST(
        root=RAW_DATA_PATH,
        split="byclass",
        download=True,
        train=True,
        transform=transform
    )

(3)CIFAR10

Task: image classification
Parameter Description: 10 kinds of 32x32 color pictures (including people, animals, flowers, insects, etc.), 6000 pictures in each category, 50000 training pictures and 10000 test pictures
Introduction: CIFAR-10 is a labeled subset of the so-called 80 million micro image data set.
Official website: https://www.cs.toronto.edu/~kriz/cifar.html
Reference: Alex Krizhevsky, Learning Multiple Layers of Features from Tiny Images, 2009
Acquisition method:
Out of the box directly from torchvision

from torchvision.datasets import CIFAR10
from torchvision.transforms import Compose, ToTensor, Normalize
RAW_DATA_PATH = './rawdata'
transform = Compose([
    ToTensor(),
    Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])
dataset = CIFAR10(
        root=RAW_DATA_PATH,
        download=True,
        train=True,
        transform=transform
    )

(3)CIFAR100

Task: image classification
Parameter Description: 100 kinds of 32x32 color pictures (including people, animals, flowers, insects, etc.), 600 pictures for each category, 500 training pictures and 100 test pictures
Introduction: the brother of CIFAR-10 is also a labeled subset of the so-called 80 million micro image data set.
Official website: https://www.cs.toronto.edu/~kriz/cifar.html
Reference: Alex Krizhevsky, Learning Multiple Layers of Features from Tiny Images, 2009
Acquisition method:
Out of the box directly from torchvision

from torchvision.datasets import CIFAR100
from torchvision.transforms import Compose, ToTensor, Normalize
RAW_DATA_PATH = './rawdata'
transform = Compose([
    ToTensor(),
    Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])
dataset = CIFAR100(
        root=RAW_DATA_PATH,
        download=True,
        train=True,
        transform=transform
    )

2. NLP dataset

(1)Shakespeare

Task: next character prediction
Parameter Description: 4226,15 samples in total
Introduction: like FEMNST, it is one of the members of the benchmark dataset leaf dedicated to federal learning.
Official website: https://leaf.cmu.edu/
Reference method: LEAF: A Benchmark for Federated Settings
Acquisition method:
Get with script

wget http://www.gutenberg.org/files/100/old/1994-01-100.zip

3. General regression / classification

(1)Synthetic

Task: two categories
Parameter Description: you can customize the number of distributed nodes, categories and dimensions

Introduction: this data set provides a method to generate artificial but challenging federated learning data sets. Our goal is to make the models on distributed nodes as independent as possible. The generation process of data set is given in detail. Like FEMNST, it is a member of the benchmark dataset leaf dedicated to federal learning.
Official website: https://leaf.cmu.edu/
Reference method: LEAF: A Benchmark for Federated Settings
Acquisition method:
The dataset needs to be manually generated according to the following python code

from scipy.special import softmax
NUM_DIM = 10
class SyntheticDataset:

    def __init__(
            self,
            num_classes=2,
            seed=931231,
            num_dim=NUM_DIM,
            prob_clusters=[0.5, 0.5]):

        np.random.seed(seed)

        self.num_classes = num_classes
        self.num_dim = num_dim
        self.num_clusters = len(prob_clusters)
        self.prob_clusters = prob_clusters

        self.side_info_dim = self.num_clusters

        self.Q = np.random.normal(
            loc=0.0, scale=1.0, size=(self.num_dim + 1, self.num_classes, self.side_info_dim))

        self.Sigma = np.zeros((self.num_dim, self.num_dim))
        for i in range(self.num_dim):
            self.Sigma[i, i] = (i + 1)**(-1.2)

        self.means = self._generate_clusters()

    def get_task(self, num_samples):
        cluster_idx = np.random.choice(
            range(self.num_clusters), size=None, replace=True, p=self.prob_clusters)
        new_task = self._generate_task(self.means[cluster_idx], cluster_idx, num_samples)
        return new_task

    def _generate_clusters(self):
        means = []
        for i in range(self.num_clusters):
            loc = np.random.normal(loc=0, scale=1., size=None)
            mu = np.random.normal(loc=loc, scale=1., size=self.side_info_dim)
            means.append(mu)
        return means

    def _generate_x(self, num_samples):
        B = np.random.normal(loc=0.0, scale=1.0, size=None)
        loc = np.random.normal(loc=B, scale=1.0, size=self.num_dim)

        samples = np.ones((num_samples, self.num_dim + 1))
        samples[:, 1:] = np.random.multivariate_normal(
            mean=loc, cov=self.Sigma, size=num_samples)

        return samples

    def _generate_y(self, x, cluster_mean):
        model_info = np.random.normal(loc=cluster_mean, scale=0.1, size=cluster_mean.shape)
        w = np.matmul(self.Q, model_info)
        
        num_samples = x.shape[0]
        prob = softmax(np.matmul(x, w) + np.random.normal(loc=0., scale=0.1, size=(num_samples, self.num_classes)), axis=1)
                
        y = np.argmax(prob, axis=1)
        return y, w, model_info

    def _generate_task(self, cluster_mean, cluster_id, num_samples):
        x = self._generate_x(num_samples)
        y, w, model_info = self._generate_y(x, cluster_mean)

        # now that we have y, we can remove the bias coeff
        x = x[:, 1:]

        return {'x': x, 'y': y, 'w': w, 'model_info': model_info, 'cluster': cluster_id}

Posted by worldofcarp on Sun, 28 Nov 2021 14:36:22 -0800

Programmer Group