Random Notes
  • Introduction
  • Reading list
  • Theory
    • Index
      • Impossibility of Distributed Consensus with One Faulty Process
      • Time, Clocks, and the Ordering of Events in a Distributed System
      • Using Reasoning About Knowledge to analyze Distributed Systems
      • CAP Twelve Years Later: How the “Rules” Have Changed
      • A Note on Distributed Computing
  • Operating System
    • Index
  • Storage
    • Index
      • Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
      • Exploiting Commutativity For Practical Fast Replication
      • Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
      • Building Consistent Transactions with Inconsistent Replication
      • Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System
      • Spanner: Google's Globally-Distributed Database
      • Bigtable: A Distributed Storage System for Structured Data
      • The Google File System
      • Dynamo: Amazon’s Highly Available Key-value Store
      • Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
      • Replicated Data Consistency Explained Through Baseball
      • Session Guarantees for Weakly Consistent Replicated Data
      • Flat Datacenter Storage
      • Small Cache, Big Effect: Provable Load Balancing forRandomly Partitioned Cluster Services
      • DistCache: provable load balancing for large-scale storage systems with distributed caching
      • Short Summaries
  • Coordination
    • Index
      • Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases
      • Paxos made simple
      • ZooKeeper: Wait-free coordination for Internet-scale systems
      • Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering
      • Keeping CALM: When Distributed Consistency is Easy
      • In Search of an Understandable Consensus Algorithm
      • A comprehensive study of Convergent and Commutative Replicated Data Types
  • Fault Tolerance
    • Index
      • The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services
      • Gray Failure: The Achilles’ Heel of Cloud-Scale Systems
      • Capturing and Enhancing In Situ System Observability for Failure Detection
      • Check before You Change: Preventing Correlated Failures in Service Updates
      • Efficient Scalable Thread-Safety-Violation Detection
      • REPT: Reverse Debugging of Failures in Deployed Software
      • Redundancy Does Not Imply Fault Tolerance
      • Fixed It For You:Protocol Repair Using Lineage Graphs
      • The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance
      • Lineage-driven Fault Injection
      • Short Summaries
  • Cloud Computing
    • Index
      • Improving MapReduce Performance in Heterogeneous Environments
      • CLARINET: WAN-Aware Optimization for Analytics Queries
      • MapReduce: Simplified Data Processing on Large Clusters
      • Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
      • Resource Management
      • Apache Hadoop YARN: Yet Another Resource Negotiator
      • Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
      • Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
      • Large-scale cluster management at Google with Borg
      • MapReduce Online
      • Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling
      • Reining in the Outliers in Map-Reduce Clusters using Mantri
      • Effective Straggler Mitigation: Attack of the Clones
      • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
      • Discretized Streams: Fault-Tolerant Streaming Computation at Scale
      • Sparrow: Distributed, Low Latency Scheduling
      • Making Sense of Performance in Data Analytics Framework
      • Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks
      • Drizzle: Fast and Adaptable Stream Processing at Scale
      • Naiad: A Timely Dataflow System
      • The Dataflow Model:A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale
      • Interruptible Tasks:Treating Memory Pressure AsInterrupts for Highly Scalable Data-Parallel Program
      • PACMan: Coordinated Memory Caching for Parallel Jobs
      • Multi-Resource Packing for Cluster Schedulers
      • Other interesting papers
  • Systems for ML
    • Index
      • A Berkeley View of Systems Challenges for AI
      • Tiresias: A GPU Cluster Managerfor Distributed Deep Learning
      • Gandiva: Introspective Cluster Scheduling for Deep Learning
      • Workshop papers
      • Hidden Technical Debt in Machine Learning Systems
      • Inference Systems
      • Parameter Servers and AllReduce
      • Federated Learning at Scale - Part I
      • Federated Learning at Scale - Part II
      • Learning From Non-IID data
      • Ray: A Distributed Framework for Emerging AI Applications
      • PipeDream: Generalized Pipeline Parallelism for DNN Training
      • DeepXplore: Automated Whitebox Testingof Deep Learning Systems
      • Distributed Machine Learning Misc.
  • ML for Systems
    • Index
      • Short Summaries
  • Machine Learning
    • Index
      • Deep Learning with Differential Privacy
      • Accelerating Deep Learning via Importance Sampling
      • A Few Useful Things to Know About Machine Learning
  • Video Analytics
    • Index
      • Scaling Video Analytics on Constrained Edge Nodes
      • Focus: Querying Large Video Datasets with Low Latency and Low Cost
      • NoScope: Optimizing Neural Network Queriesover Video at Scale
      • Live Video Analytics at Scale with Approximation and Delay-Tolerance
      • Chameleon: Scalable Adaptation of Video Analytics
      • End-to-end Learning of Action Detection from Frame Glimpses in Videos
      • Short Summaries
  • Networking
    • Index
      • Salsify: Low-Latency Network Video through Tighter Integration between a Video Codec and a Transport
      • Learning in situ: a randomized experiment in video streaming
      • Short Summaries
  • Serverless
    • Index
      • Serverless Computing: One Step Forward, Two Steps Back
      • Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads
      • SAND: Towards High-Performance Serverless Computing
      • Pocket: Elastic Ephemeral Storage for Serverless Analytics
      • Fault-tolerant and Transactional Stateful Serverless Workflows
  • Resource Disaggregation
    • Index
  • Edge Computing
    • Index
  • Security/Privacy
    • Index
      • Differential Privacy
      • Honeycrisp: Large-Scale Differentially Private Aggregation Without a Trusted Core
      • Short Summaries
  • Misc.
    • Index
      • Rate Limiting
      • Load Balancing
      • Consistency Models in Distributed System
      • Managing Complexity
      • System Design
      • Deep Dive into the Spark Scheduler
      • The Actor Model
      • Python Global Interpreter Lock
      • About Research and PhD
Powered by GitBook
On this page
  • What does IID mean?
  • Non-IID data in Federated Learning
  • Experiment
  • Existing Works
  • Some thoughts
  • References:

Was this helpful?

  1. Systems for ML
  2. Index

Learning From Non-IID data

What does IID mean?

Informally, Identically Distributed means that there are no overall trends–the distribution doesn’t fluctuate and all items in the sample are taken from the same probability distribution. Independent means that the sample items are all independent events. In other words, they aren’t connected to each other in any way.

A more technical definition of an IID statistics is:

  • Each x(i)∼Dx^{(i)} \sim \mathcal{D}x(i)∼D (Identically Distributed)

  • ∀i≠j   p(x(i),x(j))=p(x(i))p(x(j))\forall i \ne j \, \, \, p(x^{(i)}, x^{(j)}) = p(x^{(i)})p(x^{(j)}) ∀i=jp(x(i),x(j))=p(x(i))p(x(j)) (Independently Distributed)

Non-IID data in Federated Learning

A statistical model for federated learning involves two levels of sampling: accessing a datapoint requires first sampling a client i∼Qi \sim \mathcal{Q}i∼Q, the distribution over available clients, and then drawing an example (x,y)∼Pi(x,y)(x, y) \sim \mathcal{P_i}(x, y)(x,y)∼Pi​(x,y) from that client's local data distribution, where x is the features and y is the label.

Non-IID data in federated learning typically means the differences between Pi\mathcal{P_i}Pi​ and Pj\mathcal{P_j}Pj​ for different clients i and j.

The IID sampling of the training data is important to ensure that the stochastic gradient is an unbiased estimate of the full gradient. Worded differently, having IID data at the clients means that each mini-batch of data used for a client's local update is statistically identical to a uniformly drawn sample(with replacement) from the entire training dataset, which is the union of all local datasets at the clients). In practice, it is unrealistic to assume that the local data on each edge device is always IID. More specifically:

  • Violations of Independence: If the data are processed in an insufficiently-random order. (e.g. ordered by collection of devices and/or by time, then independence is violated. Moreover, devices within the same geolocation are likely to have correlated data.

  • Violations of Identicalness: Because devices are tied to particular geo-regions, the distribution of labels varies across partitions. Besides, different devices(partitions) can hold vastly different amounts of data.

Thus,

  • Data on each node being generated by a distinct distribution xt∼Ptx_t \sim P_txt​∼Pt​

  • The number of data points on each node, ntn_tnt​ , may also vary significantly

  • There may be an underlying structure present that captures the relationship amongst nodes and their associated distributions.

Most empirical work on synthetic non-IID datasets have focused on label distribution skew, where a non-IID dataset is formed by partitioning a "flat" existing dataset based on the labels.

Experiment

Existing Works

Some thoughts

References:

PreviousFederated Learning at Scale - Part IINextRay: A Distributed Framework for Emerging AI Applications

Last updated 5 years ago

Was this helpful?

Note: 1. It is also important to note that the distribution Q\mathcal{Q}Q and Pi\mathcal{P_i}Pi​ may change over time, introducing another dimension of "non-IIDness" 2. For a more detailed classification, please refer to section 3.1 of recent survey paper.

Some recent show that mose decentralized learning algorithms suffer from major model quality loss (or even divergence) when run on non-IID data partitions. However, it is interesting to note that BSP is robust to Non-IID data.

It is that the accuracy may be affected by the exact data distribution, i.e. the skewness of data distribution. More specifically, the skewness can be roughly interpreted as the distance between the data distribution on each client and the population distribution. In addition, such distance can be evaluated with the (EMD) between distributions. Based on experiment on real-world dataset, the test accuracy falls sharply with respect to EMD beyond certain threshold.

Although several solutions have been proposed to deal with highly skewed non-IID data(e.g. and ), they are both somewhat unsatisfactory. For example, some existing works[1, 2] proposes heuristic-based approaches by sharing local device data or create some server-side proxy data. However, these methods may be unrealistic: in addition to imposing burdens on network bandwidth, sending local data to the server violates the key privacy assumption of federated learning, and sending globally-shared proxy data to all devices requires effort to carefully generate or collect such auxiliary data.

If the edge devices are equipped with the capability to run training on local data, is training a single global model the optimal objective? Of course, a single global model has its benefits. For example, it can provide a model to clients with no data, or to allow manual validation and quality assurance before deployment. Nevertheless, since local training is possible, it becomes feasible for each client to have a customized model. The authors of paper argue that "Training a customized model can turn the non-IID problem from a bug to a feature, almost literally — since each client has its own model, the client’s identity effectively parameterizes the model, rendering some pathological but degenerate non-IID distributions trivial." However, this approach suffers from the problem of overfitting. Thus, I think that is the most promising technique. It begins with the federated training of a single model, and then deploy that model to all clients, where it is personalized by additional training on the local dataset before inference. (There is one from SEC' 19, which worked on similar directions.)

- Zhao et al., 2018 [1]

- Hsieh et al., 2019

- Jeong et al., 2018 [2]

- Li et al., 2019

this
works
shown
earth mover's distance
data-sharing
model traveling
this
local fine tuning
paper
Federated Learning with Non-IID Data
The Non-IID Data Quagmire of Decentralized Machine Learning
Communication-efficient on-device machine learning: Federated distillation and augmentation under non-iid private data
Federated Learning: Challenges, Methods, and Future Directions
https://arxiv.org/pdf/1910.00189.pdf