PLDI 2024
Mon 24 - Fri 28 June 2024 Copenhagen, Denmark

Mixed-precision quantization can reduce the computational requirements of Deep Neural Network (DNN) models with minimal loss of accuracy. As executing mixed-precision DNN models on Neural Processing Units (NPUs) incurs significant under-utilization of computational resources, Precision-Scalable NPUs (PSNPUs) which can process multiple low-precision layers simultaneously have been proposed. However, the under-utilization still remains significant due to the lack of adequate scheduling algorithms to support multiple mixed-precision models on PSNPUs. Therefore, in this paper, we propose a dynamic programming-based scheduling algorithm for the operations of multiple mixed-precision models. Our scheduling algorithm finds the optimal execution plan that exploits the precision-scalable MACs to improve the end-to-end inference latency of mixed-precision models. We evaluate the performance of this algorithm in terms of hardware utilization, inference latency, and schedule search time compared to baseline scheduling algorithms. The experimental results show 1.23\texttimes{} inference latency improvements over the baseline algorithms within the allowed minutes.