There is a small problem that troubles one of my code bases, the code functions fine but my understanding is lacking so i thought i would finally try to sort this 1.5 year old problem out.
I have some calculation for computing magnetic and electric fields in cuda for a while,
The ptax output is as follows,
ptxas info : Compiling entry function ‘_Z17integrateParticleP6float4S0_fj’
ptxas info : Used 47 registers, 56+28 bytes lmem, 32+28 bytes smem, 24 bytes cmem[0], 160 bytes cmem[1]
So taking the floor(8192/47) = 174, and taking into account warps, i should be able to run this code using a block size of 160 threads, however 128 is the max size i can run.
Could someone please explain this to me, my shared memory is also okay so im really lost on the ‘justification’.
I happent to download the occupancy calculator again and noticed when using a blocksize of 160threads, my register count is 9216, could someone explain why its not simple 160*47 = 7520 ?
I believe it has to do with the instruction unit, try to keep your threads in multiples of 2, but most of all a multiple of 64 (you need to register in blocks of 64 threads at a time).
for what your asking for, i think you’d do best by finding 5 registers you arn’t using much and putting them into local memory, and running 192 threads with 42 registers, 192*42 = 8064, this will work with 192 threads.
even with a wait time of 400-600cycles every time you call the local memory with 192 threads as long as your not using __syncthreads(); to much, it will be fine, and you’l recieve almose no performance loss.
your only other alternative is 128 threads.
Edit: actually you might recieve a bit of speed loss with only 192 threads, is it possible to use shared memory for thoughs 5 registers?
Where does it state this in the cuda manual? I could see the full warp logic of 160 threads, because of 32 threads in a warp. I dont see anywhere where 64 would be an issue. Could u explain this more? Thanks!!!
PS i could for sure use a little shared memory, thanks for the input.