Each of my threads has to do some computations on one 4x4 float matrix (1 matrix per thread). So I need 64B per thread (just for the matrices).
The number of 32bits-register is 8192 per multiprocessor, so I am limited to 512 threads only per MP!!! To optimize the processor occupancy, I am supposed to divide them in several blocks, at least 2… Finally, my number of threads per block seems ridiculous since I assume to need few other variables in my kernel. Is there any mistake?