Interpretable Analysis of Production GPU Clusters Monitoring Data via Association Rule Mining

Jan 1, 2024·

Baolin Li

Siddharth Samsi

Vijay Gadepally

Devesh Tiwari

· 0 min read

PDF Code Slides

Abstract

Modern high-performance computing (HPC) and cloud computing systems are integrating powerful GPUs to accelerate increasingly demanding deep learning workloads. To improve cluster efficiency and better understand user behavior and job characteristics, system operators will collect operational data for trace analysis. However, previous efforts on these system logs have lacked the interpretability aspect, and there is no systematic approach that can be widely applied to different datacenter traces and return interpretable results. In this work, we propose a workflow to discover hidden association relationships between collected features of system jobs. The outcome of our analysis approach yields useful association rules that can be directly interpreted into operational insights. Using this approach, we have conducted case studies using the traces of three large-scale multi-tenant GPU clusters running production machine learning workloads. We have focused on the observations of GPU underutilization and job failures, revealing the possible reasons for these job behaviors and suggesting solutions to mitigate them. Our case studies have demonstrated the feasibility of our interpretable analysis workflow, which can be widely used by more HPC and cloud computing system operators.

Type

Conference paper

Publication

In Proceedings of the 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Last updated on Jun 17, 2024

Authors

Baolin Li

Ph.D.

← Toward Sustainable GenAI using Generation Directives for Carbon-Friendly Large Language Model Inference Feb 1, 2024

Clover: Toward Sustainable AI with Carbon-Aware Machine Learning Inference Service May 2, 2023 →