2019 NSDI NSDI 2019

SIMON: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks

Abstract

Network measurement and monitoring have been key to understanding the inner workings of computer networks and debugging the performance problems of distributed applications. Despite many products and much research on these topics, in the context of data centers, performing accurate measurement at scale in near real-time has remained elusive. On one hand, switch-based telemetry can give accurate per-packet views, but these must be assembled across the network and across packets to get network- and application-level insight: this is not scalable. On the other hand, purely end-host-based measurement is naturally scalable but so far has only provided partial views of in-network operation. In this paper, we set out to push the boundary of edge-based measurement by scalably and accurately reconstructing the full queueing dynamics in the network with data gathered entirely at the transmit and receive network interface cards (NICs). We begin with a Signal Processing framework for quantifying a key trade-off: reconstruction accuracy versus the amount of data gathered. Based on this, we propose SIMON, an accurate and scalable measurement system for data centers that reconstructs key network state variables like packet queuing times at switches, link utilizations, and queue and link compositions at the flow-level. We then demonstrate that the function approximation capability of multi-layered neural networks can speed up SIMON by a factor of 5,000--10,000, enabling it to run in near real-time. We deployed SIMON in three testbeds with different link speeds, layers of switching and number of servers; evaluations with NetFPGAs and a cross-validation technique show that SIMON reconstructs queue-lengths to within 3-5 KBs and link utilizations to less than 1% of actual. The accuracy and speed of SIMON enables sensitive A/B tests, which greatly aids the real-time development of algorithms, protocols, network software and applications.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning
🧭 Keyword Pioneer — network measurement
🐣 Hot Topic Early Bird — signal processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio