I’m currently researching GEMV kernels (for half precision, FP16) that achieve high performance through device-specific tuning. I’ve found the following libraries:
- CUTLASS
cutlass/test/unit/gemm/device/gemv.cu at main · NVIDIA/cutlass · GitHub - FastGEMV
GitHub - wangsiping97/FastGEMV: High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline. - Custom HGEMV
GitHub - Bruce-Lee-LY/cuda_hgemv: Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
However, I haven’t been able to find a GEMV library specifically tuned for the A100 GPU.
So my questions are:
- Are there any other GEMV libraries I might have missed?
- If not, is using Nsight Profiler with various configurations the only way to tune GEMV performance on my device?