La VALSE: Scalable Log Visualization for Fault Characterization in Supercomputers
Date
2018Author
Guo, Hanqi
Di, Sheng
Gupta, Rinku
Peterka, Tom
Cappello, Franck
Metadata
Show full item recordAbstract
We design and implement La VALSE-a scalable visualization tool to explore tens of millions of records of reliability, availability, and serviceability (RAS) logs-for IBM Blue Gene/Q systems. Our tool is designed to meet various analysis requirements, including tracing causes of failure events and investigating correlations from the redundant and noisy RAS messages. La VALSE consists of multiple linked views to visualize RAS logs; each log message has a time stamp, physical location, network address, and multiple categorical dimensions such as severity and category. The timeline view features the scalable ThemeRiver and arc diagrams that enables interactive exploration of tens of millions of log messages. The spatial view visualizes the occurrences of RAS messages on hundreds of thousands of elements of Mira-compute cards, node boards, midplanes, and racks-with viewdependent level-of-detail rendering. The multidimensional view enables interactive filtering of different categorical dimensions of RAS messages. To achieve interactivity, we develop an efficient and scalable online data cube engine that can query 55 million RAS logs in less than one second. We present several case studies on Mira, a top supercomputer at Argonne National Laboratory. The case studies demonstrate that La VALSE can help users quickly identify the sources of failure events and analyze spatiotemporal correlations of RAS messages in different scales.
BibTeX
@inproceedings {10.2312:pgv.20181099,
booktitle = {Eurographics Symposium on Parallel Graphics and Visualization},
editor = {Hank Childs and Fernando Cucchietti},
title = {{La VALSE: Scalable Log Visualization for Fault Characterization in Supercomputers}},
author = {Guo, Hanqi and Di, Sheng and Gupta, Rinku and Peterka, Tom and Cappello, Franck},
year = {2018},
publisher = {The Eurographics Association},
ISSN = {1727-348X},
ISBN = {978-3-03868-054-3},
DOI = {10.2312/pgv.20181099}
}
booktitle = {Eurographics Symposium on Parallel Graphics and Visualization},
editor = {Hank Childs and Fernando Cucchietti},
title = {{La VALSE: Scalable Log Visualization for Fault Characterization in Supercomputers}},
author = {Guo, Hanqi and Di, Sheng and Gupta, Rinku and Peterka, Tom and Cappello, Franck},
year = {2018},
publisher = {The Eurographics Association},
ISSN = {1727-348X},
ISBN = {978-3-03868-054-3},
DOI = {10.2312/pgv.20181099}
}