GEMV library for NVIDIA GPU

I’m currently researching GEMV kernels (for half precision, FP16) that achieve high performance through device-specific tuning. I’ve found the following libraries:

However, I haven’t been able to find a GEMV library specifically tuned for the A100 GPU.
So my questions are:

  1. Are there any other GEMV libraries I might have missed?
  2. If not, is using Nsight Profiler with various configurations the only way to tune GEMV performance on my device?

The Ampere architecture introduced sparse matrices and asynchronous copy engines. Both can be useful for matrix multiplications. So the Nsight Profiler and optimizing the configuration would not be enough for fully covering A100 device-specific optimizations.

You could test those kernels against cuBLAS on your device.