Serving Machine Learning Inference Using Heterogeneous Hardware

Abstract

The growing popularity of machine learning algorithms and the wide availability of hardware accelerators have brought up new challenges on inference serving. This paper explores the opportunity to serve inference queries with a heterogeneous system. The system has a central optimizer that allocates heterogeneous hardware resources to cooperatively serve queries. The optimizer supports both energy minimization and throughput maximization while satisfying a latency target. The optimized heterogeneous serving system is evaluated against a homogeneous system, on two representative real-world applications of radar nowcasting and object detection. Our evaluation results show that the power-optimized heterogeneous system can achieve up to 36% of power saving, and the throughput-optimized heterogeneous system can increase query throughput by up to 53%.

Publication
In Proceedings of 2021 IEEE High Performance Extreme Computing Conference (HPEC)
Baolin Li
Baolin Li
Ph.D.

My research interests include high performance computing, cloud computing, and machine learing.