IEEE INFOCOM 2024

May 20 Mon

Program at a Glance

Workshops

Session ASoI-S1: ASoI 2024 – Session 1: Fundamental Properties and Tradeoffs Session ASoI-S2: ASoI 2024 – Session 2: Adaptive Sampling and Information Updates Session ASoI-S3: ASoI 2024 – Session 3: ASoI Optimization

Session CNERT-INTRO: CNERT 2024 – Welcome and Introduction Session CNERT-S1: CNERT 2024 – Applications and Specialized Testbeds Session CNERT-S2: CNERT 2024 – Network Management and Infrastructure Session CNERT-S3: CNERT 2024 – Cloud and Edge Computing Session CNERT-KS: CNERT 2024 – Keynote Presentation Session CNERT-DEMO: CNERT 2024 – Posters and Demos Session CNERT-S4: CNERT 2024 – 5G Networks Session CNERT-S5: CNERT 2024 – Next Generation Networks

Session DeepWireless-OS: DeepWireless 2024 – Opening Session Session DeepWireless-KS1: DeepWireless 2024 – Keynote Session I Session DeepWireless-S1: DeepWireless 2024 – Session 1: Deep Learning For Sensing Session DeepWireless-S2: DeepWireless 2024 – Session 2: Deep Learning For Communications Session DeepWireless-KS2: DeepWireless 2024 – Keynote Session II Session DeepWireless-S3: DeepWireless 2024 – Session 3: Deep Learning For Security Session DeepWireless-S4: DeepWireless 2024 – Session 4: Deep Learning For Fingerprinting Session DeepWireless-CS: DeepWireless 2024 – Closing Session

Session DroneCom-S1: DroneCom 2024 –Session I Session DroneCom-S2: DroneCom 2024 –Session II Session DroneCom-S3: DroneCom 2024 –Session III Session DroneCom-S4: DroneCom 2024 –Session IV Session DroneCom-S5: DroneCom 2024 –Session V Session DroneCom-S6: DroneCom 2024 –Session VI

Session FOGML-S1: FOGML 2024 – Session 1: Robust and Private Distributed Machine Learning Session FOGML-S2: FOGML 2024 – Session 2: Fast and Adaptive Distributed Machine Learning

Session ICCN-KS1: ICCN 2024 – Opening and Keynote Session 1: Building Resilient AI: Exploring Robustness and Heterogeneity in Federated Learning Session ICCN-S1: ICCN 2024 – Cloud and Edge Computing 1 Session ICCN-S2: ICCN 2024 – Cloud and Edge Computing 2 Session ICCN-KS2: ICCN 2024 – Keynote Session 2: Collaborative Secure Edge Intelligence for 6G IoT Session ICCN-S3: ICCN 2024 – Cloud and Edge Security Session ICCN-S4: ICCN 2024 – Cloud and Edge Applications Session ICCN-S5: ICCN 2024 – Cloud and Edge Computing 3

Session IEILM-OS: IEILM 2024 – Welcome and Opening Session IEILM-KS: IEILM 2024 – Keynote Session Session IEILM-S1: IEILM 2024 – Session 1: Collaborative Learning and Large Model in Edge Computing Session IEILM-BREAK: IEILM 2024 – Coffee Break Session IEILM-S2: IEILM 2024 – Session 2: Large Model-Driven Network and Resource Management in Edge Scenarios Session IEILM-S3: IEILM 2024 – Session 3: Economic-Level Solutions in Next Generation Networks Integrating Large Models

Session NG-OPERA-2024-OS: NGOPERA 2024 – Opening Session Session NG-OPERA-KS1: NGOPERA 2024 – Keynote Session 1 Session NG-OPERA-TS1: NGOPERA 2024 – Session 1: Large-scale Open RAN Testbeds Session NG-OPERA-TS2: NGOPERA 2024 – Session 2: Open RAN Performance Optimization and Orchestration Session NG-OPERA-PS: NGOPERA 2024 – Panel Discussion: Role of AI in 6G Open and Programmable RAN Session NG-OPERA-KS2: NGOPERA 2024 – Keynote Session 2: Extreme Reconfigurability for 6G through Programmable Open Radio Access Networks Session NG-OPERA-TS3: NGOPERA 2024 – Session 3: Systems-level Solutions for Open RANs Session NG-OPERA-TS4: NGOPERA 2024 – Session 4: Machine Learning for Open RANs Session NG-OPERA-TS5: NGOPERA 2024 – Session 5: Security for Open RANs, Posters, and Demos Session NG-OPERA-ACS: NGOPERA 2024 – Closing Session and Awards Ceremony

Session NetRobiCS-O: NetRobiCS 2024 – Opening Session Session NetRobiCS-K: NetRobiCS 2024 – Keynote Session Session NetRobiCS-C1: NetRobiCS 2024 – Coffee Break Session NetRobiCS-S1: NetRobiCS 2024 – Robotic Communication and Computation Session NetRobiCS-L: NetRobiCS 2024 – Lunch Break Session NetRobiCS-S2: NetRobiCS 2024 – Secure and Privacy-aware Robotic Networks Session NetRobiCS-C2: NetRobiCS 2024 – Coffee Break Session NetRobiCS-S3: NetRobiCS 2024 – Swarm Management and Applications

Session PerAI-6G-OS: PerAI6G 2024 – Opening Session Session PerAI-6G-KS1: PerAI6G 2024 – Keynote Session 1 Session PerAI-6G-KS2: PerAI6G 2024 – Keynote Session 2 Session PerAI-6G-S1: PerAI6G 2024 – Session 1: Distributed Learning in 6G Session PerAI-6G-S2: PerAI6G 2024 – Session 2: Reinforcement Learning for 6G Session PerAI-6G-S3: PerAI6G 2024 – Session 3: Resource Management in 6G Session PerAI-6G-S4: PerAI6G 2024 – Session 4: AI-Assisted 6G Communication

Session WNOLAM-S1: Workshop on Network Optimization for Large AI Models — Session 1 Session WNOLAM-S2: Workshop on Network Optimization for Large AI Models — Session 2

Workshop on Network Optimization for Large AI Models

Session WNOLAM-S1

Workshop on Network Optimization for Large AI Models — Session 1

Conference

2:00 PM — 3:30 PM PDT

Local

May 20 Mon, 4:00 PM — 5:30 PM CDT

Location

Georgia B

Network Bursts and Bottlenecks: Challenges of Distributed DNN Training Traffic

Jorg Liebeherr (University of Toronto)

Network traffic from distributed training of deep neural network (DNN) models has unique and peculiar properties. On the one hand, as the overall traffic pattern is repeated in each round of training, traffic is highly predictive. On the other hand, the transmissions of large tensors create a high degree of traffic burstiness, which may result in microbursts and congestion events. This presentation uses traffic measurements of distributed training of a DNN to characterize the network traffic and analyze its properties. We show that synchronization barriers and application orchestration can be effective with preventing network congestion even with a large number of worker nodes.

Speaker Jorg Liebeherr (University of Toronto)

Professor, Department of Electrical and Computer Engineering, University of Toronto

Communication-Efficient Online Distributed Optimization for Federated Learning

Ben Liang (University of Toronto)

We consider federated learning where multiple devices collaboratively train a global model with the assistance of a coordinating server. This is naturally captured by an online distributed optimization problem, where the sequence of objective functions varies unpredictably over time. Since practical machine learning often involves large datasets and high-dimensional parameters, the required information exchange between the devices and the server, repeated over many steps of the online optimization algorithm, can impose overwhelming strain on the communication capacity. In this talk, we present several recent case studies on communication-efficient online distributed optimization for federated learning. We first discuss a joint computation-communication optimization framework that encourages temporal similarity in a device’s local-model sequences to reduce the communication overhead. We then consider dynamic resource allocation among all devices to alleviate the impact of stragglers on learning latency. We finally study over-the-air computation where the devices adaptively update their local-model sequences considering the learning accuracy, the time-varying channel states, and the transmit power budget.

Speaker Ben Liang (University of Toronto)

Professor, Department of Electrical and Computer Engineering, University of Toronto

Optimizing Network Communications of Large-Scale AI Workloads Using Datacenter Multicast

Mohamed Hefeeda (Simon Fraser University)

Machine learning models are continually getting bigger. For example, recent large language models (LLMs), e.g., GPT-4 and Gemini, were reported to have more than a trillion parameters each. Training such models involves iteratively exchanging massive amounts of data among thousands of computing nodes. In many cases, especially for large clusters, the network becomes the bottleneck, and it could slow down the entire training process and waste precious GPU resources. In this talk, we will discuss the communication requirements of training large-scale machine learning models and how we can potentially realize substantial savings in network and processing resources using multicast. Current datacenter multicast systems, however, do not scale as they impose considerable state and communication overheads. We will, then, present a new multicast system, called Orca, that addresses the challenges of multicast in datacenter networks. Orca divides the state and tasks of the data plane among switches and servers, and it partially offloads the management of multicast sessions to servers. It significantly reduces the state at switches, minimizes the bandwidth overhead, incurs small and constant processing overhead, and does not limit the size of multicast sessions. Through implementation in a testbed and large-scale simulations, we show that Orca substantially outperforms the closest work in the literature. For example, Orca reduces the switch state by up to two orders of magnitude and the communication overhead by up to 19X compared to the state-of-art. We also show that Orca can considerably reduce the communication time of training tasks by optimizing the data transfer among computing nodes.

Speaker Mohamed Hefeeda (Simon Fraser University)

Professor, School of Computing Science, Simon Fraser University

Enter Zoom

Session WNOLAM-S2

Workshop on Network Optimization for Large AI Models — Session 2

Conference

4:00 PM — 6:00 PM PDT

Local

May 20 Mon, 6:00 PM — 8:00 PM CDT

Location

Georgia B

Computer to Data Centers: Data and Computation Placement for Learning-Centric Applications

Jianping Pan (University of Victoria)

Learning-centric applications (LCAs), fueled by data mining, machine learning and artificial intelligence (AI) and their applications in various domains including large language models, have imposed new challenges in data centers where massive training and large-scale inference are done nowadays. With the emerging AI clusters, traditional network protocols adopted in most current data centers cannot fully leverage the greatly improved and increased computation, storage and acceleration capability and capacity to fulfill the need of many demanding LCAs. In this talk, we will examine the problem and offer some insights from the viewpoint of network topology and data and computation placement, and hope to attract the attention from and collaboration among researchers and practitioners in computer architecture, operating systems, computer networks, high-performance computing and LCA operation.

Speaker Jianping Pan (University of Victoria)

Professor, Computer Science, University of Victoria

Panel: Network Optimization for Large AI Models

Moderator: Baochun Li (University of Toronto)

1. “Network for AI” and “AI for network” are two recent hotspots in both industry and academia. How do you envision their near-future development and potential interplay?

2. How has the rapid growth in AI model sizes impacted the design of large-scale distributed clusters? What specific challenges arise when scaling up clusters?

3. Can you share insights on the interplay between network optimization and other optimization techniques such as model parallelism, data parallelism, and pipelining in distributed AI training?

4. Looking ahead, what are the most pressing research directions of innovation in network optimization for large-scale AI clusters?

Speaker Moderator: Baochun Li (University of Toronto)

Panelists: Walter Willinger (NIKSUN, Inc.), Jorg Liebeherr (University of Toronto), Ben Liang (University of Toronto),Mohamed Hefeeda (Simon Fraser University), Jianping Pan (University of Victoria)