MixPert: Optimizing Mixed-Precision Floating-Point Emulation on GPU Integer Tensor Cores
Featuring mixed-precision tensor operations, accelerators significantly enhance performance for many error-tolerant computing tasks, but their applicability is limited in scenarios demanding high precision. While emulating higher-precision data types from lower-precision ones can bridge this gap, existing techniques either struggle to achieve sufficient accuracy or incur excessive overhead, negating performance gains. To mitigate the issue, we propose MixPert, a novel system that balances performance and accuracy via single-precision emulation on GPU Integer Tensor Cores. MixPert devises an efficient data layout and optimizes for the computation pipeline on Tensor Cores. By analyzing performance-precision trade-offs in-depth, MixPert provides users with multiple configurations based on accuracy requirements. Furthermore, MixPert is seamlessly integrated with compilers, enabling automatic adaptation and tuning of mixed-precision parameters. Evaluations on real-world scientific computing and deep learning applications demonstrate that MixPert achieves an average speedup of 1.72× compared to cuBLAS on general-purpose cores. Beyond maintaining improved accuracy, MixPert outperforms state-of-the-art approaches such as APE and CUTLASS by 1.22× and 1.21×, respectively.