I saw before that CUDA 10 included a much more lightweight BLAS library cuBLASLt, which suited me quite nicely, so I rewrote my code to use it instead of normal cuBLAS.
These are the sizes BLAS of libraries shipped with CUDA 10.2 & 11.1 for Linux:
$ ls -lh /usr/local/cuda-*/**/libcublas* ... -rwxr-xr-x 1 root root 29M 6. Nov 17:48 cuda-10.2/lib64/libcublasLt.so.10.2.3.254 -rwxr-xr-x 1 root root 65M 6. Nov 17:48 cuda-10.2/lib64/libcublas.so.10.2.3.254 -rwxr-xr-x 1 root root 215M 14. Okt 21:34 cuda-11.1/lib64/libcublasLt.so.126.96.36.199 -rwxr-xr-x 1 root root 131M 14. Okt 21:34 cuda-11.1/lib64/libcublas.so.188.8.131.52 ...
“Lightweight” cuBLASLt was quite a bit more compact before, but grew to almost 8x the size and is now much bigger than cuBLAS. I am very confused to say the least. Right now all I use it for is GEMM, so I find it hard to justify shipping 215 MB for one function. Is there a chance that this will be fixed in the future, and if not are there any (open source) recommendations for truly lightweight options?