There is a small problem that troubles one of my code bases, the code functions fine but my understanding is lacking so i thought i would finally try to sort this 1.5 year old problem out.
I have some calculation for computing magnetic and electric fields in cuda for a while,
The ptax output is as follows,
ptxas info : Compiling entry function ‘_Z17integrateParticleP6float4S0_fj’
ptxas info : Used 47 registers, 56+28 bytes lmem, 32+28 bytes smem, 24 bytes cmem, 160 bytes cmem
So taking the floor(8192/47) = 174, and taking into account warps, i should be able to run this code using a block size of 160 threads, however 128 is the max size i can run.
Could someone please explain this to me, my shared memory is also okay so im really lost on the ‘justification’.
Thanks for the help,