Performance question or bug in a program?


Can someone explain peaks in attached chart.png, that i get if i run program with some multiple of 16 for threads?

As i understand adding one thread to existing 64 threads, should bring a lot of slowdown (as it is shown in the chart - for about the factor of 2), but if i would use 55, 63 threads or soo, slowdown shouldnt be soo obvious … or is that indication of some bug in my program :mellow: ?



This is most likely due to memory coalescing when you run a multiple of 16 threads. See the programming guide for the memory coalescing rules: coalesced accesses can be orders of magnitude faster than uncoalasced (70 GiB/s vs 3 GiB/s).

Also note that the warp size is 32, so you should always run blocks sized with a multiple of 32 threads.

in fact, for future-proof code, you should run block sizes with a multiple of the warp size you determine from

 cudaGetDeviceProperties(&deviceProp, chosenDevice);

instead of hard-coded 32’s :)

Please check the deviceQuery SDK example for details.

Yes, that might be the cause.

Thanks to both for help. :smile: