Interpretable Analysis of Production GPU Clusters Monitoring Data via Association Rule Mining
Abstract
Modern high-performance computing (HPC) and cloud computing systems are integrating powerful GPUs to accelerate increasingly demanding deep learning workloads. To improve cluster efficiency and better understand user behavior and job characteristics, system operators will collect operational data for trace analysis. However, previous efforts on these system logs have lacked the interpretability aspect, and there is no systematic approach that can be widely applied to different datacenter traces and return interpretable results. In this work, we propose a workflow to discover hidden association relationships between collected features of system jobs. The outcome of our analysis approach yields useful association rules that can be directly interpreted into operational insights. Using this approach, we have conducted case studies using the traces of three large-scale multi-tenant GPU clusters running production machine learning workloads. We have focused on the observations of GPU underutilization and job failures, revealing the possible reasons for these job behaviors and suggesting solutions to mitigate them. Our case studies have demonstrated the feasibility of our interpretable analysis workflow, which can be widely used by more HPC and cloud computing system operators.
Type
Publication
In Proceedings of the 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)