Session B-1

## Collaborative Learning

Conference
2:00 PM — 3:30 PM EDT
Local
May 3 Tue, 2:00 PM — 3:30 PM EDT

### ComAI: Enabling Lightweight, Collaborative Intelligence by Retrofitting Vision DNNs

Kasthuri Jayarajah (University of Maryland Baltimore County, USA); Dhanuja Wanniarachchige (Singapore Management University, Singapore); Tarek Abdelzaher (University of Illinois, Urbana Champaign, USA); Archan Misra (Singapore Management University, Singapore)

0
While Deep Neural Network (DNN) models have transformed machine vision capabilities, their extremely high computational complexity and model sizes present a formidable deployment roadblock for AIoT applications. We show that the complexity-vs-accuracy-vs-communication tradeoffs for such DNN models can be significantly addressed via a novel, lightweight form of collaborative machine intelligence" that requires only runtime changes to the inference process. In our proposed approach, called ComAI, the DNN pipelines of different vision sensors share intermediate processing state with one another, effectively providing hints about objects located within their mutually-overlapping Field-of-Views (FoVs). \names uses two novel techniques: (a) a secondary shallow ML model that uses features from early layers of a peer DNN to predict the object confidence values for selected anchor boxes in the collaborator DNN's image, and (b) a pipelined sharing of such confidence values, by collaborators, that is then used to \emph{bias} the confidence values at the predictor layers of a reference DNN. We demonstrate that ComAI (a) can boost accuracy (recall) of DNN inference by 20-50\%, (b) works across heterogeneous DNN models and deployments, and (c) incurs negligible processing, bandwidth and processing overheads compared to non-collaborative baselines.

### Dual-track Protocol Reverse Analysis Based on Share Learning

Weiyao Zhang, Xuying Meng and Yujun Zhang (Institute of Computing Technology, Chinese Academy of Sciences, China)

0
Private protocols, whose specifications are agnostic, are widely used in the Industrial Internet. While providing customized service, they also raise essential security concerns as well, due to their agnostic nature. The Protocol Reverse Analysis (PRA) techniques are developed to infer the specifications of private protocols. However, the conventional PRA techniques are far from perfection for the following reasons: (i) Error propagation: Canonical solutions strictly follow the "from keyword extraction to message clustering" serial structure, which deteriorates the performance for ignoring the interplay between the sub-tasks, and the error will flow and accumulate through the sequential workflow. (ii) Increasing diversity: As the protocols' diversities of characteristics increase, tailoring for specific types of protocols becomes infeasible. To address these issues, we design a novel dual-track framework SPRA, and propose Share Learning, a new concept of protocol reverse analysis. Particularly, based on the share layer for protocol learning, SPRA builds a parallel workflow to co-optimize both the generative model for keyword extraction and the probability-based model for message clustering, which delivers automatic and robust syntax inference across diverse protocols and greatly improves the performance. Experiments on five real-world datasets demonstrate that the proposed SPRA achieves better performance compared with the state-of-art PRA methods.

### FedFPM: A Unified Federated Analytics Framework for Collaborative Frequent Pattern Mining

Zibo Wang and Yifei Zhu (Shanghai Jiao Tong University, China); Dan Wang (The Hong Kong Polytechnic University, Hong Kong); Zhu Han (University of Houston, USA)

0
Frequent pattern mining is an important class of knowledge discovery problems. It aims at finding out high-frequency items or structures (e.g., itemset, sequence) in a database, and plays an essential role in deriving other interesting patterns, like association rules. The traditional approach of gathering data to a central server and analyze is no longer viable due to the increasing awareness of user privacy and newly established laws on data protection. Previous privacy-preserving frequent pattern mining approaches only target a particular problem with great utility loss when handling complex structures. In this paper, we take the first initiative to propose a unified federated analytics framework (FedFPM) for a variety of frequent pattern mining problems, including item, itemset, and sequence mining. FedFPM achieves high data utility and guarantees local differential privacy without uploading raw data. Specifically, FedFPM adopts an interactive query-response approach between clients and a server. The server meticulously employs the Apriori property and the Hoeffding's inequality to generates informed queries. The clients randomize their responses in the reduced space to realize local differential privacy. Experiments on three different frequent pattern mining tasks demonstrate that FedFPM achieves better performances than the state-of-the-art specialized benchmarks, with a much smaller computation overhead.

### Layer-aware Collaborative Microservice Deployment toward Maximal Edge Throughput

Lin Gu, Zirui Chen and Honghao Xu (Huazhong University of Science and Technology, China); Deze Zeng (China University of Geosciences, China); Bo Li (Hong Kong University of Science and Technology, Hong Kong); Hai Jin (Huazhong University of Science and Technology, China)

0
Lightweight container-based microservice has been widely advocated to promote the elasticity of edge cloud. The inherent layered structure of containers offers a compelling way to cope with the resource scarcity of edge servers through layer sharing, which can significantly increase storage utilization and improve the edge throughput. Recent studies show that it is possible to share layers not only within the same server but also between servers, which microservice deployment can take full advantage of. In this paper, we investigate the problem of how to collaboratively deploy microservices by incorporating both intra-server and inter-server layer sharing to maximize the edge throughput. We formulate this problem into an integer linear programming form and prove it as NP-hard. We propose a randomized rounding based heuristic algorithm, and conduct formal analysis on the guaranteed approximation ratio. Through extensive experiments, we verify the efficiency of our proposed algorithm, and the results demonstrate that it can deploy 6x and 12x more microservice instances and improve the edge throughput by 27.74% and 38.46% in comparison with state-of-the-art strategies.

###### Session Chair

Huaiyu Dai (NC State University)

Session B-2

## Distributed ML

Conference
4:00 PM — 5:30 PM EDT
Local
May 3 Tue, 4:00 PM — 5:30 PM EDT

### Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training

Weiyan Wang (Hong Kong University of Science and Technology, Hong Kong); Cengguang Zhang (Hong Kong University of Science and Technology, China); Liu Yang (Hong Kong University of Science and Technology, Hong Kong); Kai Chen (Hong Kong University of Science and Technology, China); Kun Tan (Huawei, China)

2
BSP is the de-facto paradigm for distributed DNN training in today's production clusters. However, due to the global synchronization nature, its performance may be significantly influenced by network bottlenecks caused by either static topology heterogeneity or dynamic bandwidth contentions. Existing solutions, no matter system-level optimizations strengthening BSP (e.g., Ring or hierarchical All-reduce) or algorithmic optimizations replacing BSP (e.g., ASP or SSP, which relax the global barriers), do not completely solve the problem, as they may still suffer from communication inefficiency or risk convergence inaccuracy.

In this paper, we present a novel divide-and-shuffle synchronization (DS-Sync) to realize communication efficiency without sacrificing convergence accuracy for distributed DNN training. At its heart, by taking into account the network bottlenecks, DS-Sync improves communication efficiency by dividing workers into non-overlap groups with different sizes to synchronize independently in a bottleneck-free manner. Meanwhile, it maintains convergence accuracy by iteratively shuffling workers among groups to reach global consensus. We theoretically prove that DS-Sync converges properly in non-convex and smooth conditions like DNN. We further implement DS-Sync and integrate it with PyTorch, and our testbed experiments show that DS-Sync can achieve up to 94% improvements on end-to-end training over existing solutions while maintaining the same accuracy.

### Distributed Inference with Deep Learning Models across Heterogeneous Edge Devices

Chenghao Hu and Baochun Li (University of Toronto, Canada)

1
Recent years witnessed an increasing research attention in deploying deep learning models on edge devices for inference. Due to limited capabilities and power constraints, it may be necessary to distribute the inference workload across multiple devices. Existing mechanisms divided the model across edge devices with the assumption that deep learning models are constructed with a chain of layers. In reality, however, modern deep learning models are more complex, involving a directed acyclic graph (DAG) rather than a chain of layers.

In this paper, we present EdgeFlow, a new distributed inference mechanism designed for general DAG structured deep learning models. Specifically, EdgeFlow partitions model layers into independent execution units with a new progressive model partitioning algorithm. By producing near-optimal model partitions, our new algorithm seeks to improve the run-time performance of distributed inference as these partitions are distributed across the edge devices. During inference, EdgeFlow orchestrates the intermediate results flowing through these units to fulfill the complicated layer dependencies. We have implemented EdgeFlow based on PyTorch, and evaluated it with state-of-the-art deep learning models in different structures. The results show that EdgeFlow reducing the inference latency by up to 40.2% compared with other approaches, which demonstrates the effectiveness of our design.

### Efficient Pipeline Planning for Expedited Distributed DNN Training

Ziyue Luo and Xiaodong Yi (The University of Hong Kong, Hong Kong); Long Guoping (Institute of Computing Technology, Chinese Academy of Sciences, China); Shiqing Fan (Alibaba Group, China); Chuan Wu (The University of Hong Kong, Hong Kong); Jun Yang and Wei Lin (Alibaba Group, China)

0
To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple versions of model parameters to co-exist (similar to asynchronous training), and cannot ensure the same model convergence and accuracy performance as without pipelining. Synchronous pipelining has recently been proposed which ensures model performance by enforcing a synchronization barrier between training iterations. Nonetheless, the synchronization barrier requires waiting for gradient aggregation from all microbatches and thus delays the training progress. Optimized pipeline planning is needed to minimize such wait and hence the training time, which has not been well studied in the literature. This paper designs efficient, near-optimal algorithms for expediting synchronous pipeline-parallel training of modern large DNNs over arbitrary inter-GPU connectivity. Our algorithm framework comprises two components: a pipeline partition and device mapping algorithm, and a pipeline scheduler that decides processing order of microbatches over the partitions, which together minimize the per-iteration training time. We conduct thorough theoretical analysis, extensive testbed experiments and trace-driven simulation, and demonstrate our scheme can accelerate training up to 157% compared with state-of-the-art designs.

### Mercury: A Simple Transport Layer Scheduler to Accelerate Distributed DNN Training

Qingyang Duan, Zeqin Wang and Yuedong Xu (Fudan University, China); Shaoteng Liu (Huawei Corp., China); Jun Wu (Fudan University, China)

1
Communication scheduling is crucial to improve the efficiency of training large deep learning models with data parallelism, in which the transmission order of layer-wise deep neural network (DNN) tensors is determined for a better computation-communication overlap. Prior approaches adopt tensor partitioning to enhance the priority scheduling with finer granularity. However, a startup time slot inserted before each tensor partition will neutralize this scheduling gain. Tuning the optimal partition size is difficult and the application-layer solutions cannot eliminate the partitioning overhead. In this paper, we propose Mercury, a simple transport layer scheduler that does not partition the tensors, but moves the priority scheduling to the transport layer at the packet granularity. The packets with the highest priority in the Mercury buffer will be transmitted first. Mercury achieves the near-optimal overlapping between communication and computation. It leverages immediate aggregation at the transport layer to enable the coincident gradient push and parameter pull. We implement Mercury in MXNet and conduct comprehensive experiments on five DNN models in an 8-node cluster with 10Gbps Ethernet. Experimental results show that Mercury can achieve about 1.18 ∼ 2.18× speedup over vanilla MXNet, and 1.08 ∼ 2.04× speedup over the state-of-the-art tensor partitioning solution.

###### Session Chair

Ning Wang (Rowan University)