Where is the tiny shared memory usage from?

I have a kernel that is supposed to use 32KB shmem but nv-nsight shows that it slightly more than 32KB (32.7-ish). Why is that the case? Because of this, I cannot hold 2 threadblocks per SM. It is very hard for me to use slightly less than 32KB to compensate this.

(my GPU is T4)


What is the number of bytes you expect, and how many bytes does nsight report ? 32KB = 32768 bytes. 32.7 KB = 33484 bytes