I have a kernel that is supposed to use 32KB shmem but nv-nsight shows that it slightly more than 32KB (32.7-ish). Why is that the case? Because of this, I cannot hold 2 threadblocks per SM. It is very hard for me to use slightly less than 32KB to compensate this.
(my GPU is T4)