Occupancy Calculation in check but still 'out of resource' error.

There is a small problem that troubles one of my code bases, the code functions fine but my understanding is lacking so i thought i would finally try to sort this 1.5 year old problem out.

I have some calculation for computing magnetic and electric fields in cuda for a while,

The ptax output is as follows,

ptxas info : Compiling entry function ‘_Z17integrateParticleP6float4S0_fj’
ptxas info : Used 47 registers, 56+28 bytes lmem, 32+28 bytes smem, 24 bytes cmem[0], 160 bytes cmem[1]

So taking the floor(8192/47) = 174, and taking into account warps, i should be able to run this code using a block size of 160 threads, however 128 is the max size i can run.

Could someone please explain this to me, my shared memory is also okay so im really lost on the ‘justification’.

Thanks for the help,

I happent to download the occupancy calculator again and noticed when using a blocksize of 160threads, my register count is 9216, could someone explain why its not simple 160*47 = 7520 ?

Thanks,

I believe it has to do with the instruction unit, try to keep your threads in multiples of 2, but most of all a multiple of 64 (you need to register in blocks of 64 threads at a time).

for what your asking for, i think you’d do best by finding 5 registers you arn’t using much and putting them into local memory, and running 192 threads with 42 registers, 192*42 = 8064, this will work with 192 threads.

even with a wait time of 400-600cycles every time you call the local memory with 192 threads as long as your not using __syncthreads(); to much, it will be fine, and you’l recieve almose no performance loss.

your only other alternative is 128 threads.

Edit: actually you might recieve a bit of speed loss with only 192 threads, is it possible to use shared memory for thoughs 5 registers?

Where does it state this in the cuda manual? I could see the full warp logic of 160 threads, because of 32 threads in a warp. I dont see anywhere where 64 would be an issue. Could u explain this more? Thanks!!!

PS i could for sure use a little shared memory, thanks for the input.

NVIDIA_CUDA_ProgrammingGuide_2.3.pdf section : 5.1.2.6 Registers

[codebox]5.1.2.6 Registers

Generally, accessing a register is zero extra clock cycles per instruction, but delays

may occur due to register read-after-write dependencies and register memory bank

conflicts.

The delays introduced by read-after-write dependencies can be ignored as soon as

there are at least 192 active threads per multiprocessor to hide them.

The compiler and thread scheduler schedule the instructions as optimally as possible

to avoid register memory bank conflicts. They achieve best results when the number

of threads per block is a multiple of 64. Other than following this rule, an

application has no direct control over these bank conflicts. In particular, there is no

need to pack data into float4 or int4 types[/codebox]

your welcome :)