Workshops

Workshop on Network Optimization for Large AI Models

Session WNOLAM-S1

Workshop on Network Optimization for Large AI Models — Session 1

Conference
2:00 PM — 3:30 PM PDT
Local
May 20 Mon, 5:00 PM — 6:30 PM EDT
Location
Georgia B

Network Bursts and Bottlenecks: Challenges of Distributed DNN Training Traffic

Jorg Liebeherr (University of Toronto)

0
Network traffic from distributed training of deep neural network (DNN) models has unique and peculiar properties. On the one hand, as the overall traffic pattern is repeated in each round of training, traffic is highly predictive. On the other hand, the transmissions of large tensors create a high degree of traffic burstiness, which may result in microbursts and congestion events. This presentation uses traffic measurements of distributed training of a DNN to characterize the network traffic and analyze its properties. We show that synchronization barriers and application orchestration can be effective with preventing network congestion even with a large number of worker nodes.
Speaker Jorg Liebeherr (University of Toronto)
Professor, Department of Electrical and Computer Engineering, University of Toronto

Communication-Efficient Online Distributed Optimization for Federated Learning

Ben Liang (University of Toronto)

0
We consider federated learning where multiple devices collaboratively train a global model with the assistance of a coordinating server. This is naturally captured by an online distributed optimization problem, where the sequence of objective functions varies unpredictably over time. Since practical machine learning often involves large datasets and high-dimensional parameters, the required information exchange between the devices and the server, repeated over many steps of the online optimization algorithm, can impose overwhelming strain on the communication capacity. In this talk, we present several recent case studies on communication-efficient online distributed optimization for federated learning. We first discuss a joint computation-communication optimization framework that encourages temporal similarity in a device’s local-model sequences to reduce the communication overhead. We then consider dynamic resource allocation among all devices to alleviate the impact of stragglers on learning latency. We finally study over-the-air computation where the devices adaptively update their local-model sequences considering the learning accuracy, the time-varying channel states, and the transmit power budget.
Speaker Ben Liang (University of Toronto)
Professor, Department of Electrical and Computer Engineering, University of Toronto

Optimizing Network Communications of Large-Scale AI Workloads Using Datacenter Multicast

Mohamed Hefeeda (Simon Fraser University)

0
Machine learning models are continually getting bigger. For example, recent large language models (LLMs), e.g., GPT-4 and Gemini, were reported to have more than a trillion parameters each. Training such models involves iteratively exchanging massive amounts of data among thousands of computing nodes. In many cases, especially for large clusters, the network becomes the bottleneck, and it could slow down the entire training process and waste precious GPU resources. In this talk, we will discuss the communication requirements of training large-scale machine learning models and how we can potentially realize substantial savings in network and processing resources using multicast. Current datacenter multicast systems, however, do not scale as they impose considerable state and communication overheads. We will, then, present a new multicast system, called Orca, that addresses the challenges of multicast in datacenter networks. Orca divides the state and tasks of the data plane among switches and servers, and it partially offloads the management of multicast sessions to servers. It significantly reduces the state at switches, minimizes the bandwidth overhead, incurs small and constant processing overhead, and does not limit the size of multicast sessions. Through implementation in a testbed and large-scale simulations, we show that Orca substantially outperforms the closest work in the literature. For example, Orca reduces the switch state by up to two orders of magnitude and the communication overhead by up to 19X compared to the state-of-art. We also show that Orca can considerably reduce the communication time of training tasks by optimizing the data transfer among computing nodes.
Speaker Mohamed Hefeeda (Simon Fraser University)
Professor, School of Computing Science, Simon Fraser University

Enter Zoom
Session WNOLAM-S2

Workshop on Network Optimization for Large AI Models — Session 2

Conference
4:00 PM — 6:00 PM PDT
Local
May 20 Mon, 7:00 PM — 9:00 PM EDT
Location
Georgia B

Computer to Data Centers: Data and Computation Placement for Learning-Centric Applications

Jianping Pan (University of Victoria)

0
Learning-centric applications (LCAs), fueled by data mining, machine learning and artificial intelligence (AI) and their applications in various domains including large language models, have imposed new challenges in data centers where massive training and large-scale inference are done nowadays. With the emerging AI clusters, traditional network protocols adopted in most current data centers cannot fully leverage the greatly improved and increased computation, storage and acceleration capability and capacity to fulfill the need of many demanding LCAs. In this talk, we will examine the problem and offer some insights from the viewpoint of network topology and data and computation placement, and hope to attract the attention from and collaboration among researchers and practitioners in computer architecture, operating systems, computer networks, high-performance computing and LCA operation.
Speaker Jianping Pan (University of Victoria)
Professor, Computer Science, University of Victoria

Panel: Network Optimization for Large AI Models

Moderator: Baochun Li (University of Toronto)

0
1. “Network for AI” and “AI for network” are two recent hotspots in both industry and academia. How do you envision their near-future development and potential interplay?

2. How has the rapid growth in AI model sizes impacted the design of large-scale distributed clusters? What specific challenges arise when scaling up clusters?

3. Can you share insights on the interplay between network optimization and other optimization techniques such as model parallelism, data parallelism, and pipelining in distributed AI training?

4. Looking ahead, what are the most pressing research directions of innovation in network optimization for large-scale AI clusters?
Speaker Moderator: Baochun Li (University of Toronto)
Panelists: Walter Willinger (NIKSUN, Inc.), Jorg Liebeherr (University of Toronto), Ben Liang (University of Toronto),Mohamed Hefeeda (Simon Fraser University), Jianping Pan (University of Victoria)

Enter Zoom


Gold Sponsor


Gold Sponsor


Student Travel Grants


Student Travel Grants


Student Travel Grants

Made with in Toronto · Privacy Policy · INFOCOM 2020 · INFOCOM 2021 · INFOCOM 2022 · INFOCOM 2023 · © 2024 Duetone Corp.