Session 3-B

Scheduling II

9:00 AM — 10:30 AM EDT
Jul 8 Wed, 9:00 AM — 10:30 AM EDT

Computation Scheduling for Wireless Powered Mobile Edge Computing Networks

Tongxin Zhu and Jianzhong Li (Harbin Institute of Technology, China); Zhipeng Cai and Yingshu Li (Georgia State University, USA); Hong Gao (University of Harbin Institute Technology, China)

Mobile Edge Computing (MEC) and Wireless Power Transfer (WPT) are envisioned as two promising techniques to satisfy the increasing energy and computation requirements of latency-sensitive and computation-intensive applications installed on mobile devices. The integration of MEC and WPT introduces a novel paradigm named Wireless Powered Mobile Edge Computing (WP-MEC). In WP-MEC networks, edge devices located at the edge of radio access networks, such as access points and base stations, transmit radio frequency signals to power mobile devices and mobile devices can offload their intensive computation workloads to edge devices. In this paper, we study the Computation Completion Ratio Maximization Scheduling problem for WP-MEC networks with multiple edge devices, which is proved to be NP-hard. We jointly optimize the WPT time allocation and computation scheduling for mobile devices in a WP-MEC network to maximize the computation completion ratio of the WP-MEC network and propose approximation algorithms. The approximation ratio and computation complexity of the proposed algorithms are theoretically analyzed. Extensive simulations are conducted to verify the performance of the proposed algorithms.

Distributed and Optimal RDMA Resource Scheduling in Shared Data Center Networks

Dian Shen, Luo Junzhou, Fang Dong, Xiaolin Guo and Kai Wang (Southeast University, China); John Chi Shing Lui (Chinese University of Hong Kong, Hong Kong)

Remote Dynamic Memory Access (RDMA) suffers from unfairness issues and performance degradation when multiple applications share RDMA network resources. Hence, an efficient resource scheduling mechanism is urged to optimally allocates RDMA resources among applications. However, traditional Network Utility Maximization (NUM) based solutions are inadequate for RDMA due to three challenges: 1) The standard NUM-oriented algorithm cannot deal with coupling variables introduced by multiple dependent RDMA operations; 2) The stringently RDMA on-board resources constraint complicates the standard NUM by introducing extra optimization dimensions; 3) Naively applying traditional algorithms for NUM suffers from scalability issues in solving a large-scale RDMA resource scheduling problem.

In this paper, we present a distributed and optimal resource scheduling for RDMA networks to tackle the aforementioned challenges. First, we propose DRUM to model the RDMA resource scheduling problem as a new variation of the NUM problem. Second, we present a distributed algorithm based on the alternating directional method of multipliers (ADMM), which has the property of convergence guarantee. Third, we implement our proposed algorithm in the real-world RDMA environment, and extensively evaluate it through large scale simulations and testbed experiments. Experimental results show that our method significantly improves applications' performance under resource contention.

Injection Time Planning: Making CQF Practical in Time-Sensitive Networking

Jinli Yan, Wei Quan and Xuyan Jiang (National University of Defense Technology, China); Zhigang Sun (National Unversity of Defense Technology, China)

Time-Aware Shaper (TAS) is a core mechanism to guarantee the deterministic transmission for periodic time- sensitive flows in Time-Sensitive Networking (TSN). The generic TAS requires complex configurations for the Gate Control List (GCL) attached to each queue in a switch. To simplify the design of TSN switch, a Ping-Pong queue-based model named Cyclic Queuing and Forwarding (CQF) was proposed in IEEE 802.1 Qch by assigning fixed configurations to TAS. However, IEEE 802.1 Qch only defines the queue model and workflow of CQF. A global planning mechanism which maps the time-sensitive flows onto the underlying resources both temporally and spatially is urgently needed to make CQF practical. In this paper, we propose an Injection Time Planning (ITP) mechanism to optimize the network throughput of time-sensitive flows based on the observation that the start time when the packets are injected into the network has an important influence on the utilization of CQF queue resources. ITP provides a global temporal and spatial resource abstraction to make the implementation details transparent to algorithm designers. Based on our ITP mechanism, a novel heuristic algorithm named Tabu- ITP with domain-specific optimizing strategies is designed and evaluated under three typical network topologies in industrial control scenarios.

Preemptive All-reduce Scheduling for Expediting Distributed DNN Training

Yixin Bao, Yanghua Peng, Yangrui Chen and Chuan Wu (The University of Hong Kong, Hong Kong)

Data-parallel training is widely used for scaling DNN training over large datasets, using the parameter server or all-reduce architecture. Communication scheduling has been promising to accelerate distributed DNN training, which aims to overlap communication with computation by scheduling the order of communication operations. We identify two limitations of previous communication scheduling work. First, layer-wise computation graph has been a common assumption, while modern machine learning frameworks (e.g., TensorFlow) use a sophisticated directed acyclic graph (DAG) representation as the execution model. Second, the default sizes of tensors are often less than optimal for transmission scheduling and bandwidth utilization. We propose PACE, a communication scheduler that preemptively schedules (potentially fused) all-reduce tensors based on the DAG of DNN training, guaranteeing maximal overlapping of communication with computation and high bandwidth utilization. The scheduler contains two integrated modules: given a DAG, we identify the best tensor-preemptive communication schedule that minimizes the training time; exploiting the optimal communication scheduling as an oracle, a dynamic programming approach is developed for generating a good DAG, which merges small communication tensors for efficient bandwidth utilization. Experiments in a GPU testbed show that PACE accelerates training with representative system configurations, achieving up to 36% speed-up compared with state-of-the-art solutions.

Session Chair

Haipeng Dai (Nanjing University)

Session 5-B

Network Optimization II

2:00 PM — 3:30 PM EDT
Jul 8 Wed, 2:00 PM — 3:30 PM EDT

SAFCast: Smart Inter-Datacenter Multicast Transfer with Deadline Guarantee by Store-And-Forwarding

Hsueh-Hong Kang, Chi-Hsiang Hung and Charles H.-P. Wen (National Chiao Tung University, Taiwan)

With the increasing demand of network services, many researches employ a Software Defined Networking architecture to manage large-scale inter-datacenter networks. Some of existing services such as backup, data migration and update need to replicate data to multiple datacenters by multicast before deadline. Recent works set up minimum-weight Steiner tree as routing paths for multicasting transfers to reduce bandwidth waste; meanwhile, using deadline-aware scheduling guarantees deadlines of requests. However, there is an issue of bandwidth competition among those works. SAFCast as a new algorithm is proposed for multicasting transfers and deadline-aware scheduling. In SAFCast, we develop a tree pruning process and make datacenters employ the store-and-forwarding mechanism to improve the issue of bandwidth competition. Meanwhile, more requests can be accepted by SAFCast. Our experimental result shows that SAFCast outperforms DDCCast in the acceptance rate by 16.5%. In addition, given a volume-to-price function for revenue, SAFCast can achieve 10% more profit than DDCCast. As a result, SAFCast is a better choice for cloud providers with the effective deadline guarantee and more efficient multicasting transfers.

Scheduling for Weighted Flow and Completion Times in Reconfigurable Networks

Michael Dinitz (Johns Hopkins University, USA); Benjamin Moseley (Carnegie Mellon University, USA)

New optical technologies offer the ability to reconfigure network topologies dynamically, rather than setting them once and for all. This is true in both optical wide area networks and in datacenters, despite the many differences between these two settings. Jia et al.~[INFOCOM '17] designed online scheduling algorithms for dynamically reconfigurable topologies for both the makespan and sum of completion times objectives. In this paper, we work in the same setting but study an objective that is more meaningful in an online setting: the sum of flow times. The flow time of a job is the total amount of time that it spends in the system, which may be considerably smaller than its completion time if it is released late. We provide competitive algorithms for the online setting with speed augmentation, and also give a lower bound proving that speed augmentation is in fact necessary. As a side effect of our techniques, we also improve and generalize the results of Jia et al. on completion times by giving an $O(1)$-competitive algorithm for the arbitrary sizes and release times setting even when different nodes have different degree bounds, and moreover allow for the weighted sum of completion times (or flow times).

Scheduling Placement-Sensitive BSP Jobs with Inaccurate Execution Time Estimation

Zhenhua Han and Haisheng Tan (University of Science and Technology of China, China); Shaofeng H.-C. Jiang (Weizmann Institute of Science, Israel); Xiaoming Fu (University of Goettingen, Germany); Wanli Cao (University of Science and Technology of China, China); Francis C.M. Lau (The University of Hong Kong, Hong Kong)

The Bulk Synchronous Parallel (BSP) paradigm is gaining tremendous importance recently because of the popularity of computations such as distributed machine learning and graph computation. In a typical BSP job, multiple workers concurrently conduct iterative computations, where frequent synchronization is required. Therefore, the workers should be scheduled simultaneously and their placement on different computing devices could significantly affect the performance. Simply retrofitting a traditional scheduling discipline will likely not yield the desired performance due to the unique characteristics of BSP jobs. In this work, we derive SPIN, a novel scheduling designed for BSP jobs with placement-sensitive execution to minimize the makespan of all jobs. We first prove the problem approximation hardness and then present how SPIN solves it with a rounding-based randomized approximation approach. Our analysis indicates SPIN achieves a good performance guarantee efficiently. Moreover, SPIN is robust against misestimation of job execution time by theoretically bounding its negative impact. We implement SPIN on a production-trace driven testbed with 40 GPUs. Our extensive experiments show that SPIN can reduce the job makespan and the average job completion time by up to 3x and 4.68x, respectively. Our approach also demonstrates better robustness to execution time misestimation compared with heuristic baselines.

Tiny Tasks - A Remedy for Synchronization Constraints in Multi-Server Systems

Markus Fidler and Brenton Walker (Leibniz Universität Hannover, Germany); Stefan Bora (Universität Hannover, Germany)

Queuing models of parallel processing systems typically assume that one has \(l\) servers and jobs are split into and equal number of \(k=l\) tasks. This seemingly simple approximation has surprisingly large consequences for the resulting stability and performance bounds. In reality, best practices for modern map-reduce systems indicate that a job's partitioning factor should be much larger than the number of servers available, with some researchers going to far as to advocate for a "tiny-tasks" regime, where jobs are split into over 10,000 tasks. In this paper we use recent advances in stochastic network calculus to fundamentally understand the effects of task granularity on parallel systems' scaling, stability, and performance. For the split-merge model, we show that when one allows for tiny tasks, the stability region is actually much better than had previously been concluded. For the single-queue fork-join model, we show that sojourn times quickly approach the optimal case when \(l\) "big tasks" are sub-divided into \(k >= l\) "tiny tasks". Our results are validated using extensive simulations, and the applicability of the models used is validated by experiments on an Apache Spark cluster.

Session Chair

Gong Zhang (Huawei Technologies Co. Ltd.)

Session 6-B

Network Optimization III

4:00 PM — 5:30 PM EDT
Jul 8 Wed, 4:00 PM — 5:30 PM EDT

Clustering-preserving Network Flow Sketching

Yongquan Fu, Dongsheng Li, Siqi Shen and Yiming Zhang (National University of Defense Technology, China); Kai Chen (Hong Kong University of Science and Technology, China)

Network monitoring is vital in modern clouds and data center networks that need diverse traffic statistics ranging from flow size distributions to heavy hitters. To cope with increasing network rates and massive traffic volumes, sketch based approximate measurement has been extensively studied to trade the accuracy for memory and computation cost, which unfortunately, is sensitive to hash collisions.

This paper presents a clustering-preserving sketch method to be resilient to hash collisions. We provide an equivalence analysis of the sketch in terms of the K-means clustering. Based on the analysis result, we cluster similar network flows to the same bucket array to reduce the estimation variance and uses the average to obtain unbiased estimation. Testbed shows that the framework adapts to line rates and provides accurate query results. Real-world trace-driven simulations show that LSS remains stable performance under wide ranges of parameters and dramatically outperforms state-of-the-art sketching structures, with over \(10^3\) to \(10^5\) times reduction in relative errors for per-flow queries as the ratio of the number of buckets to the number of network flows reduces from 10% to 0.1%.

Efficient Coflow Transmission for Distributed Stream Processing

Wenxin Li (Hong Kong University of Science & Technology, Hong Kong); Xu Yuan (University of Louisiana at Lafayette, USA); Wenyu Qu (Tianjin University, China); Heng Qi (Dalian University of Technology, China); Xiaobo Zhou, Sheng Chen and Renhai Xu (Tianjin University, China)

Distributed streaming applications require the underlying network flows to transmit packets continuously to keep their output results fresh. These results will become stale if no updates come, and their staleness is determined by the slowest flow. At this point, coflows can be semantically comprised. Hence, efficient coflow transmission is critical for streaming applications. However, prior coflow-based solutions have significant limitations. They use a one-shot performance metric---CCT (coflow completion time), which cannot continuously reflect the staleness of the output results for streaming applications. To this end, we propose a new performance metric---coflow age (CA), which tracks the longest time-since-last-service among all flows in a coflow. We consider a datacenter network with multiple coflows that continuously transmit packets between their source-destination pairs, and address the problem of minimizing the average long-term CA while simultaneously satisfying the throughput constraints from the coflows. We design a randomized algorithm and a drift-plus-age algorithm and show that they can make the average long-term CA to achieve nearly two times and arbitrarily close to the optimal value, respectively. Extensive simulations demonstrate that both of the proposed algorithms can significantly reduce the CA of coflows, without violating the throughput requirement of any coflow, compared to the state-of-the-art solution.

Online Network Flow Optimization for Multi-Grade Service Chains

Victor Valls (Yale University, USA); George Iosifidis (Trinity College Dublin, Ireland); Geeth Ranmal de Mel (IBM Research, United Kingdom (Great Britain)); Leandros Tassiulas (Yale University, USA)

We study the problem of in-network execution of data analytic services using multi-grade VNF chains. The nodes host VNFs offering different and possibly time-varying gains for each stage of the chain, and our goal is to maximize the analytics performance while minimizing the data transfer and processing costs. The VNFs' performance is revealed only after their execution, since it is data-dependent or controlled by third-parties, while the service requests and network costs might also vary with time. We devise an operation algorithm that learns, on the fly, the optimal routing policy and the composition and length of each chain. Our algorithm combines a lightweight sampling technique and a Lagrange-based primal-dual iteration, allowing it to be scalable and attain provable optimality guarantees. We demonstrate the performance of the proposed algorithm using a video analytics service, and explore how it is affected by different system parameters. Our model and optimization framework is readily extensible to different types of networks and services.

SketchFlow: Per-Flow Systematic Sampling Using Sketch Saturation Event

RhongHo Jang (Inha University, Korea (South) & University of Central Florida, USA); DaeHong Min and SeongKwang Moon (Inha University, Korea (South)); David Mohaisen (University of Central Florida, USA); Daehun Nyang (Ewha Womans University & TheVaulters Company, Korea (South))

Random sampling is a versatile tool to reduce the processing overhead in various systems. NetFlow uses a local table for counting records per flow, and sFlow sends out periodically collected packet headers to a collecting server over the network. Any measurement system falls into either one of these two models. To reduce the burden on the table or on the network, sampled packets are given to those systems. However, if the sampling rate is more than the available resource capacity, sampled packets will be dropped, which obviously degrades measurement quality. In this paper, we introduce a new concept of per-flow systematic sampling, and provide a concrete sampling method called SketchFlow using a sketch saturation event without any application-specific information to measure accurately per-flow spectral density on large volume of data in real-time. SketchFlow shows a new direction to the sampling framework of sketch saturation event-based sampling. We demonstrate SketchFlow's performance in terms of stable sampling rate, accuracy, and overhead using real-world dataset such as backbone network trace, hard disk I/O trace, and Twitter dataset.

Session Chair

Sergey I Nikolenko (Harbour Space University)

Made with in Toronto · Privacy Policy · © 2021 Duetone Corp.