After reading my cublin file, I noticed that I use 28 registers per threads and no local or shared memory (except for kernel parameters). So I tried to store some variables in shared memory and local memory to increase the occupancy rate of my SM… without success.
Even by replacing 10 floats from registers
float m11, m12, m13, m22, m23, m24, m33, m34, m44;
with a shared array
__shared__ float mat[BLOCKDIM_X*BLOCKDIM_Y*10];
(is not efficient of course but it was a test), the number of register in the new cubin file stay the same (but the SMEM value increase a lot of course)!
Same by using global memory instead of registers…
I’am trying to compare the ptx files of different tests, but it is not so easy at first…