Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources

Apr 1, 2023·

Baolin Li

Siddharth Samsi

Vijay Gadepally

Devesh Tiwari

· 0 min read

PDF Cite Code Slides DOI

Abstract

Online inference is becoming a key service product for many businesses, deployed in cloud platforms to meet customer demands. Despite their revenue-generation capability, these services need to operate under tight Quality-of-Service (QoS) and cost budget constraints. This paper introduces Kairos, a novel runtime framework that maximizes the query throughput while meeting QoS target and a cost budget. Kairos designs and implements novel techniques to build a pool of heterogeneous compute hardware without online exploration overhead, and distribute inference queries optimally at runtime. Our evaluation using industry-grade machine learning (ML) models shows that Kairos yields up to 2× the throughput of an optimal homogeneous solution, and outperforms state-of-the-art schemes by up to 70%, despite advantageous implementations of the competing schemes to ignore their exploration overhead.

Type

Conference paper

Publication

In Proceedings of the 2023 ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC)

Last updated on Jun 17, 2024

Authors

Baolin Li

Ph.D.

← Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale Apr 2, 2023

MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant GPU Clusters Mar 1, 2022 →