Performance of Histogram256 demo Atomic writes are slow when conflict

Hi y’all,

I was wondering how conflicts were handled when multiple threads try to do atomic writes to the same adress. I modified the Histogram256 demo in the CUDA SDK so that all pixels in the input image have the same value:


   // (...)

   printf("...generating input data\n");


   for(i = 0; i < DATA_N; i++)


      //h_Data[i] = rand() % 256;

      h_Data[i] = 255;


   // (...)

When the input values are random, atomic writes are being done to different adresses and there are relatively few conflits, here are the results I get with a 8800 GT:

CPU: 350 ms

SM10: 17 ms

SM11: 19 ms

Now when the input values are all 255, atomic writes are always in conflict:

CPU: 341 ms

SM10: 303 ms

SM11: 335 ms

For a real-time system the GPU becomes less attractive because in the worst-case scenario the processing time is similar to that of the CPU. Anyone knows of a trick to minimize write conflits?

Lots of solutions, mostly involving doing local accumulation and then one final global atomic add.

You could use shared atomics and let each block accumulate in shared memory, and only the block-wide answer sent via atomics to global.

You could try using threads, each writing into their own private buffer, and they accumulate their answers per block first and then the block accumulates globally via atomics. Probably you’d use 2-byte accumulators, and multiple passes over the image if you need to process more than 65535 input samples.

You could have each block accumulate and write to global memory, then run a second kernel to do a reduction on those values… (or do it on the CPU).

You could do a quick random sampling and try to detect cases where the histogram is degenerate with only 1 or 2 constant values, and switch to a simpler mode. You’d take one pass accumulating just those 1 or 2 special values locally in a register, and write those accumulations to global memory, then take a “Real” pass and accumulating all the remaining “rare” values with a traditional write.

Thanks for the hints!

Time for a useless smiley: :play_ball:

EDIT: removed my suggestion, as it probably wouldn’t have worked.

I effectively suggested the same technique as in the #ifndef ATOMICS case…oops

Just out of curiosity: Have you benchmarked the same scenario without using the atomics? Is it any faster?

The histogram256 demo tests all compute capability levels available (i.e. 1.0 and 1.1 on my GT 8800).

For an input vector with random values, computation with 1.1 (global memory atomics) is marginally faster than with 1.0 (no atomics), i.e. 17.0 for 1.1 and 19.8 for 1.0.

When the input vector contains constant value, computation with or without global atomics take about the same time, 278.8 ms.

I couldn’t test with computation capability 1.2 which has shared memory atomics, I guess that would likely be the best combination. With very low latency, the overhead of atomics has to be much lower than the 400+ cycles for external memory.

So then the issue we see may be shared memory bank conflicts mostly, when the per-warp or per-block histogram is formed.