Learning From Non-IID data

What does IID mean?

Informally, Identically Distributed means that there are no overall trends–the distribution doesn’t fluctuate and all items in the sample are taken from the same probability distribution. Independent means that the sample items are all independent events. In other words, they aren’t connected to each other in any way.

A more technical definition of an IID statistics is:

  • Each x(i)Dx^{(i)} \sim \mathcal{D} (Identically Distributed)

  • ijp(x(i),x(j))=p(x(i))p(x(j))\forall i \ne j \, \, \, p(x^{(i)}, x^{(j)}) = p(x^{(i)})p(x^{(j)}) (Independently Distributed)

Non-IID data in Federated Learning

A statistical model for federated learning involves two levels of sampling: accessing a datapoint requires first sampling a client iQi \sim \mathcal{Q}, the distribution over available clients, and then drawing an example (x,y)Pi(x,y)(x, y) \sim \mathcal{P_i}(x, y) from that client's local data distribution, where x is the features and y is the label.

Non-IID data in federated learning typically means the differences between Pi\mathcal{P_i} and Pj\mathcal{P_j} for different clients i and j.

The IID sampling of the training data is important to ensure that the stochastic gradient is an unbiased estimate of the full gradient. Worded differently, having IID data at the clients means that each mini-batch of data used for a client's local update is statistically identical to a uniformly drawn sample(with replacement) from the entire training dataset, which is the union of all local datasets at the clients). In practice, it is unrealistic to assume that the local data on each edge device is always IID. More specifically:

  • Violations of Independence: If the data are processed in an insufficiently-random order. (e.g. ordered by collection of devices and/or by time, then independence is violated. Moreover, devices within the same geolocation are likely to have correlated data.

  • Violations of Identicalness: Because devices are tied to particular geo-regions, the distribution of labels varies across partitions. Besides, different devices(partitions) can hold vastly different amounts of data.

Thus,

  • Data on each node being generated by a distinct distribution xtPtx_t \sim P_t

  • The number of data points on each node, ntn_t , may also vary significantly

  • There may be an underlying structure present that captures the relationship amongst nodes and their associated distributions.

Most empirical work on synthetic non-IID datasets have focused on label distribution skew, where a non-IID dataset is formed by partitioning a "flat" existing dataset based on the labels.

Note: 1. It is also important to note that the distribution Q\mathcal{Q} and Pi\mathcal{P_i} may change over time, introducing another dimension of "non-IIDness" 2. For a more detailed classification, please refer to section 3.1 of this recent survey paper.

Experiment

Some recent works show that mose decentralized learning algorithms suffer from major model quality loss (or even divergence) when run on non-IID data partitions. However, it is interesting to note that BSP is robust to Non-IID data.

It is shown that the accuracy may be affected by the exact data distribution, i.e. the skewness of data distribution. More specifically, the skewness can be roughly interpreted as the distance between the data distribution on each client and the population distribution. In addition, such distance can be evaluated with the earth mover's distance(EMD) between distributions. Based on experiment on real-world dataset, the test accuracy falls sharply with respect to EMD beyond certain threshold.

Existing Works

Although several solutions have been proposed to deal with highly skewed non-IID data(e.g. data-sharing and model traveling), they are both somewhat unsatisfactory. For example, some existing works[1, 2] proposes heuristic-based approaches by sharing local device data or create some server-side proxy data. However, these methods may be unrealistic: in addition to imposing burdens on network bandwidth, sending local data to the server violates the key privacy assumption of federated learning, and sending globally-shared proxy data to all devices requires effort to carefully generate or collect such auxiliary data.

Some thoughts

If the edge devices are equipped with the capability to run training on local data, is training a single global model the optimal objective? Of course, a single global model has its benefits. For example, it can provide a model to clients with no data, or to allow manual validation and quality assurance before deployment. Nevertheless, since local training is possible, it becomes feasible for each client to have a customized model. The authors of this paper argue that "Training a customized model can turn the non-IID problem from a bug to a feature, almost literally — since each client has its own model, the client’s identity effectively parameterizes the model, rendering some pathological but degenerate non-IID distributions trivial." However, this approach suffers from the problem of overfitting. Thus, I think that local fine tuning is the most promising technique. It begins with the federated training of a single model, and then deploy that model to all clients, where it is personalized by additional training on the local dataset before inference. (There is one paper from SEC' 19, which worked on similar directions.)

References:

Last updated