When I increase the blocksize from 128 to 256, I saw a decrease in the performace(from 1.5ms to 1.6ms). However, according to the compiling information, no local variables spill to local memory(they are still stored in registers), the occupancy is still 1.0. Any reasons are responsible for the performace decrease? Thank you.
Try to get your application to run longer. 1.5-1.6ms in probably in the measurement noise of your system. A simple OS context switch during your application would be enough to account for the difference. It may not have anything to do with the kernel actually taking longer to execute.
Try putting a loop around your kernel, launching it many times, and taking the average time.
From 1.0 occupancy you mean 100 % rite , I Guess you are not using shared memory or very small amount of it then ? …
also are your writes and reads from global memory coalesced ? cause I have seen that more number of independent thread blocks helps you better hide this global memory latency.
yes, the rule is the more the merrier. As suggested you managed to hide the global memory latency better when you could fit more blocks on each SM.
I did lauch the kernel many times and the performance is the average.
You mean that more blocks are more preferable than bigger block size?
In a way… but you have to give more preference to occupancy first as that provides more bang.
Once reach an optimized value occupancy ~50% (single precision kernel) for CC1.3 devices then I guess more block size makes more positive impact performance; this I say from my experience.
Hence if both your implementations have same occupancy and register usage then more number of blocks should help you hide the memory latency better + it will also help you scale your applications to future devices like FERMI which have lot more number of cores.
But you should las keep in mind the register RAW latencies which depend on your kernel register usage and how you actually coded up the algorithm. RAW latencies are hidden once we have > 192 threads but they have a less impact on performance then memory latency hence they only affect algorithms which are compute bound. Looks like yours is memory bound.
Hope this helps
Larger blocks will tend to be slower if you’re using __syncthreads. This may only be the case if you have branching which isn’t consistant across the block though.