Random Notes
  • Introduction
  • Reading list
  • Theory
    • Index
      • Impossibility of Distributed Consensus with One Faulty Process
      • Time, Clocks, and the Ordering of Events in a Distributed System
      • Using Reasoning About Knowledge to analyze Distributed Systems
      • CAP Twelve Years Later: How the “Rules” Have Changed
      • A Note on Distributed Computing
  • Operating System
    • Index
  • Storage
    • Index
      • Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
      • Exploiting Commutativity For Practical Fast Replication
      • Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
      • Building Consistent Transactions with Inconsistent Replication
      • Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System
      • Spanner: Google's Globally-Distributed Database
      • Bigtable: A Distributed Storage System for Structured Data
      • The Google File System
      • Dynamo: Amazon’s Highly Available Key-value Store
      • Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
      • Replicated Data Consistency Explained Through Baseball
      • Session Guarantees for Weakly Consistent Replicated Data
      • Flat Datacenter Storage
      • Small Cache, Big Effect: Provable Load Balancing forRandomly Partitioned Cluster Services
      • DistCache: provable load balancing for large-scale storage systems with distributed caching
      • Short Summaries
  • Coordination
    • Index
      • Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases
      • Paxos made simple
      • ZooKeeper: Wait-free coordination for Internet-scale systems
      • Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering
      • Keeping CALM: When Distributed Consistency is Easy
      • In Search of an Understandable Consensus Algorithm
      • A comprehensive study of Convergent and Commutative Replicated Data Types
  • Fault Tolerance
    • Index
      • The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services
      • Gray Failure: The Achilles’ Heel of Cloud-Scale Systems
      • Capturing and Enhancing In Situ System Observability for Failure Detection
      • Check before You Change: Preventing Correlated Failures in Service Updates
      • Efficient Scalable Thread-Safety-Violation Detection
      • REPT: Reverse Debugging of Failures in Deployed Software
      • Redundancy Does Not Imply Fault Tolerance
      • Fixed It For You:Protocol Repair Using Lineage Graphs
      • The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance
      • Lineage-driven Fault Injection
      • Short Summaries
  • Cloud Computing
    • Index
      • Improving MapReduce Performance in Heterogeneous Environments
      • CLARINET: WAN-Aware Optimization for Analytics Queries
      • MapReduce: Simplified Data Processing on Large Clusters
      • Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
      • Resource Management
      • Apache Hadoop YARN: Yet Another Resource Negotiator
      • Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
      • Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
      • Large-scale cluster management at Google with Borg
      • MapReduce Online
      • Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling
      • Reining in the Outliers in Map-Reduce Clusters using Mantri
      • Effective Straggler Mitigation: Attack of the Clones
      • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
      • Discretized Streams: Fault-Tolerant Streaming Computation at Scale
      • Sparrow: Distributed, Low Latency Scheduling
      • Making Sense of Performance in Data Analytics Framework
      • Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks
      • Drizzle: Fast and Adaptable Stream Processing at Scale
      • Naiad: A Timely Dataflow System
      • The Dataflow Model:A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale
      • Interruptible Tasks:Treating Memory Pressure AsInterrupts for Highly Scalable Data-Parallel Program
      • PACMan: Coordinated Memory Caching for Parallel Jobs
      • Multi-Resource Packing for Cluster Schedulers
      • Other interesting papers
  • Systems for ML
    • Index
      • A Berkeley View of Systems Challenges for AI
      • Tiresias: A GPU Cluster Managerfor Distributed Deep Learning
      • Gandiva: Introspective Cluster Scheduling for Deep Learning
      • Workshop papers
      • Hidden Technical Debt in Machine Learning Systems
      • Inference Systems
      • Parameter Servers and AllReduce
      • Federated Learning at Scale - Part I
      • Federated Learning at Scale - Part II
      • Learning From Non-IID data
      • Ray: A Distributed Framework for Emerging AI Applications
      • PipeDream: Generalized Pipeline Parallelism for DNN Training
      • DeepXplore: Automated Whitebox Testingof Deep Learning Systems
      • Distributed Machine Learning Misc.
  • ML for Systems
    • Index
      • Short Summaries
  • Machine Learning
    • Index
      • Deep Learning with Differential Privacy
      • Accelerating Deep Learning via Importance Sampling
      • A Few Useful Things to Know About Machine Learning
  • Video Analytics
    • Index
      • Scaling Video Analytics on Constrained Edge Nodes
      • Focus: Querying Large Video Datasets with Low Latency and Low Cost
      • NoScope: Optimizing Neural Network Queriesover Video at Scale
      • Live Video Analytics at Scale with Approximation and Delay-Tolerance
      • Chameleon: Scalable Adaptation of Video Analytics
      • End-to-end Learning of Action Detection from Frame Glimpses in Videos
      • Short Summaries
  • Networking
    • Index
      • Salsify: Low-Latency Network Video through Tighter Integration between a Video Codec and a Transport
      • Learning in situ: a randomized experiment in video streaming
      • Short Summaries
  • Serverless
    • Index
      • Serverless Computing: One Step Forward, Two Steps Back
      • Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads
      • SAND: Towards High-Performance Serverless Computing
      • Pocket: Elastic Ephemeral Storage for Serverless Analytics
      • Fault-tolerant and Transactional Stateful Serverless Workflows
  • Resource Disaggregation
    • Index
  • Edge Computing
    • Index
  • Security/Privacy
    • Index
      • Differential Privacy
      • Honeycrisp: Large-Scale Differentially Private Aggregation Without a Trusted Core
      • Short Summaries
  • Misc.
    • Index
      • Rate Limiting
      • Load Balancing
      • Consistency Models in Distributed System
      • Managing Complexity
      • System Design
      • Deep Dive into the Spark Scheduler
      • The Actor Model
      • Python Global Interpreter Lock
      • About Research and PhD
Powered by GitBook
On this page
  • Summary
  • The Lifecycle of a Model in Federated Learning
  • Google's System design for FL
  • Federated Averaging
  • Comparison between Parameter Server and FL:
  • Applications
  • Security and Privacy
  • Advantages
  • Challenges and Limitations
  • Related Links:

Was this helpful?

  1. Systems for ML
  2. Index

Federated Learning at Scale - Part I

PreviousParameter Servers and AllReduceNextFederated Learning at Scale - Part II

Last updated 5 years ago

Was this helpful?

Summary

Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL allows for smarter models, lower latency, less bandwidth usage, and less power consumption, all while ensuring privacy and user experiences.

Nowadays, much of the data is born decentralized. For example, billions of phones and IoT devices constantly generate data, and the data can be used to enable better products. Informally, instead of bringing all data into a centralized server and train a model, FL brings the model to where the data lives, train it locally, and only upload the update to the server. Local data storing and processing with global coordination is made possible by the emerging technology of mobile edge computing(MEC), where edge nodes, such as sensors, home gateways, micro servers, and small cells, are equipped with storage and computation capability.

Note: FL is a term coined by researchers from Google and research projects in Berkeley use the term shared learning instead of Federated Learning.

The FL process is typically driven by a model engineer developing a model for a particular application. At a high level, a typical workflow is:

  • Problem identification: The model engineer identifies a problem to be solved with FL.

  • Simulation prototyping (optional): The model engineer may prototype model architectures and test

    hyperparameters in an FL simulation using a proxy dataset or in real setting.

  • Federated model training: Multiple federated training tasks are started to train different variations of the model, or use different optimization hyperparameters.

  • (Federated) model evaluation: After the tasks have trained sufficiently, the models need to be analyzed. This analysis may include metrics computed on standard datasets in the datacenter, or federated evaluation wherein the models are pushed to held-out clients for evaluation on local client data.

  • Deployment: Finally, once a good model is selected, it goes through a standard model launch process.

We term our approach Federated Learning, since the learning task is solved by a loose federation of participating devices which are coordinated by a central server.

Device-server protocol

Each round of the protocol contains three phases.

  • Selection: The server(i.e. the coordinator in the server) picks a subset of available devices to work on a specific FL task[1].

  • Configuration: The server sends the FL plan(execution plan) and the FL checkpoint(i.e. a serialized state of a Tensorflow session) with the global model to each of the chosen devices. When receiving a FL task, the FL runtime will be responsible for performing local training.

  • Reporting: The server waits for the participating devices to report updates. The round is considered successful if enough devices report in time. (The update to the model is often sent to the server using encrypted communication.)

As an analogy, we can interpret the FL server as the reducer, and FL devices as mappers.

[1]An FL task is a specific computation for an FL population(a learning problem/application), such as training to be performed with given hyperparameters.

Device

The device should maintain a repository of locally collected data for model training and evaluation. Applications are responsible for making their data available to the FL runtime as an example store(e.g. an SQLite database recording action suggestions show to the user and whether or not these suggestions were accepted). When a task arrived at the device, the FL runtime will access an appropriate example store to compute model updates.

Two things to note here: 1. We need to avoid any negative impact on the user experience. Thus, the FL runtime will only start the task when the phone is idle, connected to the WiFi/power etc. 2. FL plans are not specialized to training, but can also encode evaluation tasks.

For other details, such as multi-tenancy and attestation, please refer to the paper.

Server

  • Coordinators are Top-level actors(one per population) which enable global synchronization and advancing rounds in lockstep. As previously mentioned, The Coordinator receives information about how many devices are connected to each Selector and instructs them how many devices to accept for participation, based on which FL tasks are scheduled.

  • Selectors are responsible for accepting and forwarding device connections. After the Master Aggregator and set of Aggregators are spawned, the Coordinator instructs the Selectors to forward a subset of its connected devices to the Aggregators, allowing the Coordinator to efficiently allocate devices to FL tasks regardless of how many devices are available

  • Master Aggregators manage the rounds of each FL task. In order to scale with the number of devices and update size, they make dynamic decisions to spawn one or more Aggregators to which work is delegated.

FederatedAveraging is a variation of traditional Stochastic gradient descent(SGD) algorithm, which combines local SGD on each client with a server that performs model averaging.

At the beginning of each round, a random fraction C of clients is selected, and the server sends the current global algorithm state to each of these clients (e.g., the current model parameters). We only select a fraction of clients for efficiency, as the experiments show diminishing returns for adding more clients beyond a certain point. Each selected client then performs local computation based on the global state and its local dataset, and sends an update to the server. The server then applies these updates to its global state, and the process repeats.

In other words, the high-level protocol is:

  1. Workers pull the latest model from the server

  2. Workers compute an update l based on their local data

  3. Workers send their local update to the server

  4. The server aggregates these updates (by averaging) to construct the new global model

The amount of computation is controlled by three key parameters: C, the fraction of clients that perform computation on each round; E, the number of training passes each client makes over its local dataset on each round; and B, the local minibatch size used for the client updates.

However, the paper does not provide any theoretical convergence guarantee and the experiments were not conducted in a network setting.

Comparison between Parameter Server and FL:

Federated Learning protocol is very similar to the traditional parameter server protocol. The differences are:

  1. In data center setting, shared storage is usually used, which means the worker machine do not keep persistent data storage on their own, and they fetch data from the shared storage at the beginning of each iteration.

  2. In FL, the data, and thus the loss function, on the different clients may be very heterogeneous, and far from being representative of the joint data.(e.g. the data stored on each client may be highly non-IID)

  3. In FL, the server never keeps track of any individual client information and only uses aggregates to ensure privacy.

  4. Because of the high churn in FL setting, only a small subset of the devices are selected by the server in each round.

Applications

Federated Learning applies best in situations where the on-device data is more relevant than the data that exists on servers (e.g., the devices generate the data in the first place), is privacy-sensitive, or otherwise undesirable or infeasible to transmit to servers.

  • Risk modeling

  • Smart Devices

    • Detecting burglaries within smart homes

  • Healthcare

    • predicting health events like low blood sugar or heart attack risk from wearable devices

  • Browsers

In general, FL is most appropriate when:

  • On-device data is more relevant than server-side proxy data

  • On-device data is privacy sensitive or large

  • Labels can be inferred naturally from user interaction

Security and Privacy

While FL enables model training on decentralized data and only the updates are sent to the server, the attacker may still be able to get some information just from these updates.

Advantages

  • Highly efficient use of network bandwidth

    • Less information is required to be transmitted to the cloud.

  • Privacy

    • As described above, the raw data of users need not be sent to the cloud.

    • With guaranteed privacy, more users will be willing to take part in collaborative model training and so, better inference models are built.

  • Low latency

    • The latency is much lower than that when decisions are made in the cloud before transmitting them to the end devices. This is vital for time critical applications such as self-driving car systems in which the slightest delays can potentially be life threatening

Challenges and Limitations

  • Does it work? And if so, why?

    • We can prove FL works for linear models and a couple of other special cases, but we cannot prove it works for more complicated things like neural networks unless we train the model in a non-federated way and demonstrate that it gets almost the same performance.

  • Security

  • Statistical and System heterogeneity

    • In a large and complex mobile edge network, the heterogeneity of participating devices in terms of data quality, computation power, and willingness to participate have to be well managed from the resource allocation perspective.

    • See next post for more detail.

  • Slow, unstable and limited communication:

    • Due to the high dimensionality of model updates and limited communication bandwidth of participating mobile devices, communication costs remain an issue.

    • One potential mitigation is that we can select N devices in each round and proceed with K response( K≤NK\le N K≤N).

  • May not work for tree-based algorithms(e.g. decision trees)

Related Links:

The goal is to build a system that can train a deep neural network on data stored on the phone which will never leave the device. The weights are combine in the cloud with , constructing a global model which is pushed back to the phone for inference.

The FL server is designed around the . The main actors include:

on-device item ranking, , and

The mitigations for this kind of attack include homomorphic encryption and : encrypting the model updates such that the server can still perform the algebraic operations necessary to combine them, but updates not sent in plaintext.

The other mitigation is . The basic idea is to add random noise to the individual updates. These updates are going to be accumulated, and thus your noise should cancel with all the noise other people have added. For example, if we want to compute the average height of people in a class without asking each individual's height, which might be sensitive, we could ask everyone in the class to add a random number(e.g. normally distributed with a mean of 0) to our height and tell us the result. Then, we can compute the average height without knowing anyone's height.

Recent shows that a malicious participant may exist in FL and can infer the information of other participants from shared parameters. As such, privacy and security issues in FL need to be considered.

Problem of Non-IID data: In datacenters, workers usually use a shared storage. The worker machine do not keep persistent data storage on their own, and they fetch data from the shared storage at the beginning of the learning process(or each iteration). As a result, it's possible to guarantee that the data samples obtained by different workers are IID. In Federated Learning, we cannot make assumptions. ; that is, a device’s local data cannot be regarded as samples drawn from the overall distribution.

If we use mobile phone for Federated Learning, uploads are typically going to be much than downloads and network latency may be very slow as well. Even worse, the FL system does not have control over users’ devices. For example, when a mobile phone is turned off or WiFi access is unavailable, the central server will lose connection to this device.

(A great introduction)

- a Python library for secure, private Deep Learning

Google's System design for FL
Federated Averaging
Actor Programming model
Federated Averaging
content suggestions for on-device keyboards
next word prediction
Federated Learning for Firefox
secure aggregation
differential privacy
study
iid
slower
Tensorflow Federated Github Repo
Google's blog on Federated Learning
Pysyft
Federated Learning for Firefox
Damien Desfontaines's blog post on differential privacy
Frank McSherry's blog on deep learning and differential privacy
IID Statistics: Independent and Identically Distributed Definition and Examples
The Lifecycle of a Model in Federated Learning
https://arxiv.org/pdf/1912.04977.pdf
The detailed algorithm
Image Credit: Mike Williams
Deep Models Under the GAN - Hatij et al., 2017