I have experimented the performance of atomicadd function on C2050 machine.
We measured execution time on case where there is no memory conflict and possible coalescing access on global memory.
We fixed block size and increased threads per block.
According to the result of our experiment,
the performance of atomicadd was rapidly increased when the number of thread was the multiple of half warp.
The performance was increased more when it was the multiple of warp size.
I wonder this reason.
I attached the graph about our experiment.
block size: 140
X-axis: thread per block
Y-axis: time(ms)
red line: atomicadd, non memory conrflict, coalsecing access
blue line: += operation, non memory conrflict, coalsecing access atomicVSnonatomic.pdf (169 KB)
Could you describe how to do use atomic operation, each thread access different memory location?
what do you mean “coalesce access” on these atomic operations ?
Can you add random conflict to test case?
(this will ensure that no specific code optimization is made based on coalsecing access & fully loaded warps)
This is too simple code. In case of full warp may be device or compiler (to device code) is smart enough to replace atomic_add to simple add instruction.
It will be much more interesting if small amount of inc conflicts appears
And probably performance degradation dependency of conflict % is much more interesting than performance itself.
I.e. programmer writes atomic_add in case if he can’t guarantee absents of conflict, but they can be very rare.
For the compiler to replace with simple add, the compiler should be able to know that all threads are adding the “same number” at the warp-level. Unless and until you are using a “constant” to do it, it will be very difficult for the compiler to diagnose.
For the device to do it, it needs to have comparators to do this. On the outset, it looks like a complex work that will come handy only in some corner cases anyway…