Dramatic size increase of "lightweight" cuBLASLt library from CUDA 10 to 11

I saw before that CUDA 10 included a much more lightweight BLAS library cuBLASLt, which suited me quite nicely, so I rewrote my code to use it instead of normal cuBLAS.
These are the sizes BLAS of libraries shipped with CUDA 10.2 & 11.1 for Linux:

$ ls -lh /usr/local/cuda-*/**/libcublas*
...
-rwxr-xr-x 1 root root  29M  6. Nov 17:48 cuda-10.2/lib64/libcublasLt.so.10.2.3.254
-rwxr-xr-x 1 root root  65M  6. Nov 17:48 cuda-10.2/lib64/libcublas.so.10.2.3.254
-rwxr-xr-x 1 root root 215M 14. Okt 21:34 cuda-11.1/lib64/libcublasLt.so.11.3.0.106
-rwxr-xr-x 1 root root 131M 14. Okt 21:34 cuda-11.1/lib64/libcublas.so.11.3.0.106
...

“Lightweight” cuBLASLt was quite a bit more compact before, but grew to almost 8x the size and is now much bigger than cuBLAS. I am very confused to say the least. Right now all I use it for is GEMM, so I find it hard to justify shipping 215 MB for one function. Is there a chance that this will be fixed in the future, and if not are there any (open source) recommendations for truly lightweight options?

Replaced cuBLASLt with a simple CUTLASS device function. Was surprisingly even simpler than either cuBLAS or cuBLASLt API. The increase in binary size is negligible and performance wasn’t critical anyway.