IEEE INFOCOM 2021

Session E-7

Federated Learning 3

Conference

10:00 AM — 11:30 AM EDT

Local

May 13 Thu, 7:00 AM — 8:30 AM PDT

Device Sampling for Heterogeneous Federated Learning: Theory, Algorithms, and Implementation

Su Wang (Purdue University, USA); Mengyuan Lee (Zhejiang University, China); Seyyedali Hosseinalipour (Purdue University, USA); Roberto Morabito (Ericsson Research & Princeton University, Finland); Mung Chiang (Purdue University, USA); Christopher G. Brinton (Purdue University & Zoomi Inc., USA)

0

The conventional federated learning (FedL) architecture distributes machine learning (ML) across worker devices by having them train local models that are periodically aggregated by a server. FedL ignores two important characteristics of contemporary wireless networks, however: (i) the network may contain heterogeneous communication/computation resources, while (ii) there may be significant overlaps in devices' local data distributions. In this work, we develop a novel optimization methodology that jointly accounts for these factors via intelligent device sampling complemented by device-to-device (D2D) offloading. Our optimization aims to select the best combination of sampled nodes and data offloading configuration to maximize FedL training accuracy subject to realistic constraints on the network topology and device capabilities. Theoretical analysis of the D2D offloading subproblem leads to new FedL convergence bounds and an efficient sequential convex optimizer. Using this result, we develop a sampling methodology based on graph convolutional networks (GCNs) which learns the relationship between network attributes, sampled nodes, and resulting offloading that maximizes FedL accuracy. Through evaluation on real-world datasets and network measurements from our IoT testbed, we find that our methodology while sampling less than 5% of all devices outperforms conventional FedL substantially both in terms of trained model accuracy and required resource utilization.

Sample-level Data Selection for Federated Learning

Anran Li, Lan Zhang, Juntao Tan, Yaxuan Qin, Junhao Wang and Xiang-Yang Li (University of Science and Technology of China, China)

0

Federated learning (FL) enables participants to collaboratively construct a global machine learning model without sharing their local training data to the remote server. In FL systems, the selection of training samples has a significant impact on model performances, e.g., selecting participants whose datasets have erroneous samples, skewed categorical distributions, and low content diversity would result in low accuracy and unstable models. In this work, we aim to solve the exigent optimization problem that selects a collection of high-quality training samples for a given FL task under a monetary budget in a privacy-preserving way, which is extremely challenging without visibility to participants' local data and training process. We provide a systematic analysis of important data related factors affecting the model performance and propose a holistic design to privately and efficiently select high-quality data samples considering all these factors. We verify the merits of our proposed solution with extensive experiments on a real AIoT system with 50 clients, including 20 edge computers, 20 laptops, and 10 desktops. The experimental results validates that our solution achieves accurate and efficient selection of high-quality data samples, and consequently an FL model with a faster convergence speed and higher accuracy than that achieved by existing solutions.

An Incentive Mechanism for Cross-Silo Federated Learning: A Public Goods Perspective

Ming Tang and Vincent W.S. Wong (University of British Columbia, Canada)

2

In cross-silo federated learning (FL), organizations cooperatively train a global model with their local data. The organizations, however, may be heterogeneous in terms of their valuation on the precision of the trained global model and their training cost. Meanwhile, the computational and communication resources of the organizations are non-excludable public goods. That is, even if an organization does not perform any local training, other organizations cannot prevent that organization from using the outcome of their resources (i.e., the trained global model). To address the organization heterogeneity and the public goods feature, in this paper, we formulate a social welfare maximization problem and propose an incentive mechanism for cross-silo FL. With the proposed mechanism, organizations can achieve not only social welfare maximization but also individual rationality and budget balance. Moreover, we propose a distributed algorithm that enables organizations to maximize the social welfare without knowing the valuation and cost of each other. Our simulations with MNIST dataset show that the proposed algorithm converges faster than a benchmark method. Furthermore, when organizations have higher valuation on precision, the proposed mechanism and algorithm are more beneficial in the sense that the organizations can achieve higher social welfare through participating in cross-silo FL.

Learning-Driven Decentralized Machine Learning in Resource-Constrained Wireless Edge Computing

Zeyu Meng, Hongli Xu and Min Chen (University of Science and Technology of China, China); Yang Xu (University of Science and Technology of China & School of Computer Science and Technology, China); Yangming Zhao and Chunming Qiao (University at Buffalo, USA)

2

Data generated at the network edge can be processed locally by leveraging the paradigm of edge computing. To fully utilize the widely distributed data, we concentrate on a wireless edge computing system that conducts model training using decentralized peer-to-peer (P2P) methods. However, there are two major challenges on the way towards efficient P2P model training: limited resources (e.g., network bandwidth and battery life of mobile edge devices) and time-varying network connectivity due to device mobility or wireless channel dynamics, which have received less attention in recent years. To address these two challenges, this paper adaptively constructs a dynamic and efficient P2P topology, where model aggregation occurs at the edge devices. In a nutshell, we first formulate the topology construction for P2P learning (TCPL) problem with resource constraints as an integer programming problem. Then a learning-driven method is proposed to adaptively construct a topology at each training epoch. We further give the convergence analysis on training machine learning models even with non-convex loss functions. Extensive simulation results show that our proposed method can improve the model training efficiency by about 11% with resource constraints and reduce the communication cost by about 30% under the same accuracy requirement compared to the benchmarks.

Session Chair

Chuan Wu (The University of Hong Kong)

Session E-8

Distributed ML

Conference

12:00 PM — 1:30 PM EDT

Local

May 13 Thu, 9:00 AM — 10:30 AM PDT

Live Gradient Compensation for Evading Stragglers in Distributed Learning

Jian Xu (Tsinghua University, China); Shao-Lun Huang (Tsinghua-Berkeley Shenzhen Institute, China); Linqi Song (City University of Hong Kong, Hong Kong); Tian Lan (George Washington University, USA)

1

The training efficiency of distributed learning systems is vulnerable to stragglers, namely, those slow worker nodes. A naive strategy is performing the distributed learning by incorporating the fastest K workers and ignoring these stragglers, which may induce high deviation for non-IID data. To tackle this, we develop a Live Gradient Compensation (LGC) strategy to incorporate the one-step delayed gradients from stragglers, aiming to accelerate learning process and utilize the stragglers simultaneously. In LGC framework, mini-batch data are divided into smaller blocks and processed separately, which makes the gradient computed based on partial work accessible. In addition, we provide theoretical convergence analysis of our algorithm for non-convex optimization problem under non-IID training data to show that LGC-SGD has almost the same convergence error as full synchronous SGD. The theoretical results also allow us to quantify a novel tradeoff in minimizing training time and error by selecting the optimal straggler threshold. Finally, extensive simulation experiments of image classification on CIFAR-10 dataset are conducted, and the numerical results demonstrate the effectiveness of our proposed strategy.

Exploiting Simultaneous Communications to Accelerate Data Parallel Distributed Deep Learning

Shaohuai Shi (The Hong Kong University of Science and Technology, Hong Kong); Xiaowen Chu (Hong Kong Baptist University, Hong Kong); Bo Li (Hong Kong University of Science and Technology, Hong Kong)

1

Synchronous stochastic gradient descent (S-SGD) with data parallelism is widely used for training deep learning (DL) models in distributed systems. A pipelined schedule of the computing and communication tasks of a DL training job is an effective scheme to hide some communication costs. In such pipelined S-SGD, tensor fusion (i.e., merging some consecutive layers' gradients for a single communication) is a key ingredient to improve communication efficiency. However, existing tensor fusion techniques schedule the communication tasks sequentially, which overlooks their independence nature. In this paper, we expand the design space of scheduling by exploiting simultaneous All-Reduce communications. Through theoretical analysis and experiments, we show that simultaneous All-Reduce communications can effectively improve the communication efficiency of small tensors. We formulate an optimization problem of minimizing the training iteration time, in which both tensor fusion and simultaneous communications are allowed. We develop an efficient optimal scheduling solution and implement the distributed training algorithm ASC-WFBP with Horovod and PyTorch. We conduct real-world experiments on an 8-node GPU cluster of 32 GPUs with 10Gbps Ethernet. Experimental results on four modern DNNs show that ASC-WFBP can achieve about 1.09 ⇥ 2.48⇥ speedup over the baseline without tensor fusion, and 1.15⇥ 1.35⇥ speedup over the state-of-the-art tensor fusion solution.

Low Sample and Communication Complexities in Decentralized Learning: A Triple Hybrid Approach

Xin Zhang (Iowa State University, USA); Jia Liu (The Ohio State University, USA); Zhengyuan Zhu (Iowa State University, USA); Elizabeth Serena Bentley (AFRL, USA)

3

Network-consensus-based decentralized learning optimization algorithms have attracted a significant amount of attention in recent years due to their rapidly growing applications. However, most of the existing decentralized learning algorithms could not achieve low sample and communication complexities simultaneously - two important metrics in evaluating the trade-off between computation and communication costs of decentralized learning. To overcome these limitations, in this paper, we propose a triple hybrid decentralized stochastic gradient descent (TH-DSGD) algorithm for efficiently solving non-convex network-consensus optimization problems for decentralized learning. We show that to reach an ɛ 2 -stationary solution, the total sample complexity of TH-DSGD is O(ɛ −-3 ) and the communication complexity is O(ɛ −-3 ), both of which are independent of dataset sizes and significantly improve the sample and communication complexities of the existing works. We conduct extensive experiments with a variety of learning models to verify our theoretical findings. We also show that our TH-DSGD algorithm is stable as the network topology gets sparse and enjoys better convergence in the large-system regime.

DC2: Delay-aware Compression Control for Distributed Machine Learning

Ahmed M. Abdelmoniem and Marco Canini (KAUST, Saudi Arabia)

1

Distributed training performs data-parallel training of DNN models which is a necessity for increasingly complex models and large datasets. Recent works are identifying major communication bottlenecks in distributed training. These works seek possible opportunities to speed-up the training in systems supporting distributed ML workloads. As communication reduction, compression techniques are proposed to speed-up this communication phase. However, compression comes at the cost of reduced model accuracy, especially when compression is applied arbitrarily. Instead, we advocate a more controlled use of compression and propose DC2, a delay-aware compression control mechanism. DC2 couples compression control and network delays in applying compression adaptively. DC2 not only compensates for network variations but can also strike a better trade-off between training speed and accuracy. DC2 is implemented as a drop-in module to the communication library used by the ML toolkit and can operate in a variety of network settings. We empirically evaluate DC2 in network environments exhibiting low and high delay variations. Our evaluation of different popular CNN models and datasets shows that DC2 improves training speed-ups of up to 41⇥ and 5.3⇥ over baselines with no-compression and uniform compression, respectively.

Session Chair

Zhichao Cao (Michigan State University)

Session E-9

Sensing and Learning

Conference

2:30 PM — 4:00 PM EDT

Local

May 13 Thu, 11:30 AM — 1:00 PM PDT

DeepSense: Fast Wideband Spectrum Sensing Through Real-Time In-the-Loop Deep Learning

Daniel Uvaydov, Salvatore D'Oro, Francesco Restuccia and Tommaso Melodia (Northeastern University, USA)

0

Spectrum sharing will be a key technology to tackle spectrum scarcity in the sub-6 GHz bands. To fairly access the shared bandwidth, wireless users will necessarily need to quickly sense large portions of spectrum and opportunistically access unutilized bands. The key unaddressed challenges of spectrum sensing are that (i) it has to be performed with extremely low latency over large bandwidths to detect tiny spectrum holes and to guarantee strict real-time digital signal processing (DSP) constraints; (ii) its underlying algorithms need to be extremely accurate, and flexible enough to work with different wireless bands and protocols to find application in real-world settings. To the best of our knowledge, the literature lacks spectrum sensing techniques able to accomplish both requirements. In this paper, we propose DeepSense, a software/hardware framework for real-time wideband spectrum sensing that relies on real-time deep learning tightly integrated into the transceiver's baseband processing logic to detect and exploit unutilized spectrum bands. DeepSense uses a convolutional neural network (CNN) implemented in the wireless platform's hardware fabric to analyze a small portion of the unprocessed baseband waveform to automatically extract the maximum amount of information with the least amount of I/Q samples. We extensively validate the accuracy, latency and generality performance of DeepSense with (i) a 400 GB dataset containing hundreds of thousands of WiFi transmissions collected "in the wild" with different Signal-to-Noise-Ratio (SNR) conditions and over different days; (ii) a dataset of transmissions collected using our own software-defined radio testbed; and (iii) a synthetic dataset of LTE transmissions under controlled SNR conditions. We also measure the real-time latency of the CNNs trained on the three datasets with an FPGA implementation, and compare our approach with a fixed energy threshold mechanism. Results show that our learning-based approach can deliver a precision and recall of 98% and 97% respectively and a latency as low as 0.61ms. For reproducibility and benchmarking purposes, we pledge to share the code and the datasets used in this paper to the community.

Bayesian Online Learning for Energy-Aware Resource Orchestration in Virtualized RANs

Jose A. Ayala-Romero (Trinity College Dublin, Ireland); Andres Garcia-Saavedra (NEC Labs Europe, Germany); Xavier Costa-Perez (NEC Laboratories Europe, Germany); George Iosifidis (Delft University of Technology, The Netherlands)

0

Radio Access Network Virtualization (vRAN) will spearhead the quest towards supple radio stacks that adapt to heterogeneous infrastructure: from energy-constrained platforms deploying cells-on-wheels (e.g., drones) or battery-powered cells to green edge clouds. We perform an in-depth experimental analysis of the energy consumption of virtualized Base Stations (vBSs) and render two conclusions: (i) characterizing performance and power consumption is intricate as it depends on human behavior such as network load or user mobility; and (ii) there are many control policies and some of them have non-linear and monotonic relations with power and throughput. Driven by our experimental insights, we argue that machine learning holds the key for vBS control. We formulate two problems and two algorithms: (i) BP-vRAN, which uses Bayesian online learning to balance performance and energy consumption, and (ii) SBP-vRAN, which augments our Bayesian optimization approach with safe controls that maximize performance while respecting hard power constraints. We show that our approaches are data-efficient and have provably performance, which is paramount for carrier-grade vRANs. We demonstrate the convergence and flexibility of our approach and assess its performance using an experimental prototype.

Multi-Agent Reinforcement Learning for Urban Crowd Sensing with For-Hire Vehicles

Rong Ding (Shanghai Jiao Tong University, China); Zhaoxing Yang, Yifei Wei and Haiming Jin (Shanghai Jiao Tong University, China); Xinbing Wang (Shanghai Jiaotong University, China)

2

Recently, vehicular crowd sensing (VCS) that leverages sensor-equipped urban vehicles to collect city-scale sensory data has emerged as a promising paradigm for urban sensing. Nowadays, a wide spectrum of VCS tasks are carried out by for-hire vehicles (FHVs) due to various hardware and software constraints that are difficult for private vehicles to satisfy. However, such FHV-enabled VCS systems face a fundamental yet unsolved problem of striking a balance between the order-serving and sensing outcomes. To address this problem, we propose a novel graph convolutional cooperative multi-agent reinforcement learning (GCC-MARL) framework, which helps FHVs make distributed routing decisions that cooperatively optimize the system-wide global objective. Specifically, GCC-MARL meticulously assigns credits to agents in the training process to effectively stimulate cooperation, represents agents' actions by a carefully chosen statistics to cope with the variable agent scales, and integrates graph convolution to capture useful spatial features from complex large-scale urban road networks. We conduct extensive experiments with a real-world dataset collected in Shenzhen, China, containing around 1 million trajectories and 50 thousand orders of 553 taxis per-day from June 1st to 30th, 2017. Our experiment results show that GCC-MARL outperforms state-of-the-art baseline methods in order-serving revenue, as well as sensing coverage and quality.

Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training

Zhe Zhang and Chuan Wu (The University of Hong Kong, Hong Kong); Zongpeng Li (Wuhan University & University of Calgary, China)

0

Distributed machine learning with multiple concurrent workers has been widely adopted to train large deep neural networks (DNNs). Parameter synchronization is a key component in each iteration of distributed training, where workers exchange locally computed gradients through an AllReduce operation or parameter servers, for global parameter updates. Parameter synchronization often constitutes a significant portion of the training time; minimizing the communication time contributes substantially to DNN training speed-up. Standard ring-based AllReduce or PS architecture work efficiently mostly with homogeneous inter-worker connectivity. However, available bandwidth among workers in real-world clusters is often heterogeneous, due to different hardware configurations, switching topologies, and contention with concurrent jobs. This work investigates the best parameter synchronization topology and schedule among workers for most expedited communication in distributed DNN training. We show that the optimal parameter synchronization topology should be comprised of trees with different workers as roots, each for aggregating or broadcasting a partition of gradients/parameters. We identify near-optimal forest packing to maximally utilize available bandwidth and overlap aggregation and broadcast stages to minimize communication time. We provide theoretical analysis of the performance bound, and show that our scheme outperforms state-of-the-art parameter synchronization schemes by up to 18.3 times with extensive evaluation under various settings.

Session Chair

Bo Ji (Virginia Tech)

Session E-10

Learning Networks

Conference

4:30 PM — 6:00 PM EDT

Local

May 13 Thu, 1:30 PM — 3:00 PM PDT

Analyzing Learning-Based Networked Systems with Formal Verification

Arnaud Dethise and Marco Canini (KAUST, Saudi Arabia); Nina Narodytska (VMware Research Group, USA)

0

As more applications of (deep) neural networks emerge in the computer networking domain, the correctness and predictability of a neural agent's behavior for corner case inputs are becoming crucial. Enabling the formal analysis of agents with nontrivial properties, we bridge between specifying intended high-level behavior and expressing low-level statements directly encoded into an efficient verification framework. Our results support that within minutes, one can establish the resilience of a neural network to adversarial attacks on its inputs, as well as formally prove properties that were previously relying on educated guesses. Finally, we also show how formal verification can help create an accurate visual representation of an agent behavior to perform visual inspection and improve its trustworthiness.

Bringing Fairness to Actor-Critic Reinforcement Learning for Network Utility Optimization

Jingdi Chen and Yimeng Wang (The George Washington University, USA); Tian Lan (George Washington University, USA)

0

Fairness is a crucial design objective in virtually all network optimization problems, where limited system resources are shared by multiple agents. Recently, reinforcement learning has been successfully applied to autonomous online decision making in many network design and optimization problems. However, most of them try to maximize the long-term (discounted) reward of all agents, without taking fairness into account. In this paper, we propose a family of algorithms that bring fairness to actor-critic reinforcement learning for optimizing general fairness utility functions. In particular, we present a novel method for adjusting the rewards in standard reinforcement learning by a multiplicative weight depending on both the shape of fairness utility and some statistics of past rewards. It is shown that for proper choice of the adjusted rewards, a policy gradient update converges to at least a stationary point of general α-fairness utility optimization. It inspires the design of fairness optimization algorithms in actor-critic reinforcement learning. Evaluations show that the proposed algorithm can be easily deployed in real-world network optimization problems, such as wireless scheduling and video QoE optimization, and can significantly improve the fairness utility value over previous heuristics and learning algorithms.

Incentive Mechanism Design for Distributed Coded Machine Learning

Ningning Ding (The Chinese University of Hong Kong, Hong Kong); Zhixuan Fang (Tsinghua University, China); Lingjie Duan (Singapore University of Technology and Design (SUTD), Singapore); Jianwei Huang (The Chinese University of Hong Kong, Shenzhen, China)

1

A distributed machine learning platform needs to recruit many heterogeneous worker nodes to finish computation simultaneously. As a result, the overall performance may be degraded due to straggling workers. By introducing redundancy into computation, coded machine learning can effectively improve the runtime performance by recovering the final computation result through the first k (out of the total n) workers who finish computation. While existing studies focus on designing efficient coding schemes, the issue of designing proper incentives to encourage worker participation is still under-explored. This paper studies the platform's optimal incentive mechanism for motivating proper workers' participation in coded machine learning, despite the incomplete information about heterogeneous workers' computation performances and costs. A key contribution of this work is to summarize workers' multi-dimensional heterogeneity as a one-dimensional metric, which guides the platform's efficient selection of workers under incomplete information with a linear computation complexity. Moreover, we prove that the optimal recovery threshold k is linearly proportional to the participator number n if we use the widely adopted MDS codes for data encoding. We also show that the platform's increased cost due to incomplete information disappears when worker number is sufficiently large, but it does not monotonically decrease in worker number.

Efficient Learning-based Scheduling for Information Freshness in Wireless Networks

Bin Li (University of Rhode Island, USA)

1

Motivated by the recent trend of integrating artificial intelligence into the Internet-of-Things (IoT), we consider the problem of scheduling packets from multiple sensing sources to a central controller over a wireless network. Here, packets from different sensing sources have different values or degrees of importance to the central controller for intelligent decision making. In such a setup, it is critical to provide timely and valuable information for the central controller. In this paper, we develop a parameterized maximum-weight type scheduling policy that combines both the AoI metrics and Upper Confidence Bound (UCB) estimates in its weight measure with parameter η. Here, UCB estimates balance the tradeoff between exploration and exploitation in learning and are critical for yielding a small cumulative regret. We show that our proposed algorithm yields the running average total age at most by O(N 2 η). We also prove that our proposed algorithm achieves the cumulative regret over time horizon T at most by O(NT/η+ √ NT log T ). This reveals a tradeoff between the cumulative regret and the running average total age: when increasing η, the cumulative regret becomes smaller, but is at the cost of increasing running average total age. Simulation results are provided to evaluate the efficiency of our proposed algorithm.

Session Chair

WenZhan Song (University of Georgia)

Program at a Glance