Learning From Non-IID data
What does IID mean?
Informally, Identically Distributed means that there are no overall trends–the distribution doesn’t fluctuate and all items in the sample are taken from the same probability distribution. Independent means that the sample items are all independent events. In other words, they aren’t connected to each other in any way.
A more technical definition of an IID statistics is:
Each (Identically Distributed)
(Independently Distributed)
Non-IID data in Federated Learning
A statistical model for federated learning involves two levels of sampling: accessing a datapoint requires first sampling a client , the distribution over available clients, and then drawing an example from that client's local data distribution, where x is the features and y is the label.
Non-IID data in federated learning typically means the differences between and for different clients i and j.
The IID sampling of the training data is important to ensure that the stochastic gradient is an unbiased estimate of the full gradient. Worded differently, having IID data at the clients means that each mini-batch of data used for a client's local update is statistically identical to a uniformly drawn sample(with replacement) from the entire training dataset, which is the union of all local datasets at the clients). In practice, it is unrealistic to assume that the local data on each edge device is always IID. More specifically:
Violations of Independence: If the data are processed in an insufficiently-random order. (e.g. ordered by collection of devices and/or by time, then independence is violated. Moreover, devices within the same geolocation are likely to have correlated data.
Violations of Identicalness: Because devices are tied to particular geo-regions, the distribution of labels varies across partitions. Besides, different devices(partitions) can hold vastly different amounts of data.
Thus,
Data on each node being generated by a distinct distribution
The number of data points on each node, , may also vary significantly
There may be an underlying structure present that captures the relationship amongst nodes and their associated distributions.
Most empirical work on synthetic non-IID datasets have focused on label distribution skew, where a non-IID dataset is formed by partitioning a "flat" existing dataset based on the labels.
Experiment
Existing Works
Some thoughts
References:
Last updated
Was this helpful?