Performance of Histogram256 demo Atomic writes are slow when conflict

AlfredDube · July 29, 2008, 1:56pm

Hi y’all,

I was wondering how conflicts were handled when multiple threads try to do atomic writes to the same adress. I modified the Histogram256 demo in the CUDA SDK so that all pixels in the input image have the same value:

  

   // (...)

   printf("...generating input data\n");

   srand(2007);

   for(i = 0; i < DATA_N; i++)

   {

      //h_Data[i] = rand() % 256;

      h_Data[i] = 255;

   }

   // (...)

When the input values are random, atomic writes are being done to different adresses and there are relatively few conflits, here are the results I get with a 8800 GT:

CPU: 350 ms

SM10: 17 ms

SM11: 19 ms

Now when the input values are all 255, atomic writes are always in conflict:

CPU: 341 ms

SM10: 303 ms

SM11: 335 ms

For a real-time system the GPU becomes less attractive because in the worst-case scenario the processing time is similar to that of the CPU. Anyone knows of a trick to minimize write conflits?

SPWorley · July 29, 2008, 4:10pm

Lots of solutions, mostly involving doing local accumulation and then one final global atomic add.

You could use shared atomics and let each block accumulate in shared memory, and only the block-wide answer sent via atomics to global.

You could try using threads, each writing into their own private buffer, and they accumulate their answers per block first and then the block accumulates globally via atomics. Probably you’d use 2-byte accumulators, and multiple passes over the image if you need to process more than 65535 input samples.

You could have each block accumulate and write to global memory, then run a second kernel to do a reduction on those values… (or do it on the CPU).

You could do a quick random sampling and try to detect cases where the histogram is degenerate with only 1 or 2 constant values, and switch to a simpler mode. You’d take one pass accumulating just those 1 or 2 special values locally in a register, and write those accumulations to global memory, then take a “Real” pass and accumulating all the remaining “rare” values with a traditional write.

AlfredDube · July 29, 2008, 9:04pm

Thanks for the hints!

Time for a useless smiley: :play_ball:

cbuchner1 · July 31, 2008, 9:11pm

EDIT: removed my suggestion, as it probably wouldn’t have worked.

I effectively suggested the same technique as in the #ifndef ATOMICS case…oops

Just out of curiosity: Have you benchmarked the same scenario without using the atomics? Is it any faster?

AlfredDube · August 1, 2008, 2:43pm

The histogram256 demo tests all compute capability levels available (i.e. 1.0 and 1.1 on my GT 8800).

For an input vector with random values, computation with 1.1 (global memory atomics) is marginally faster than with 1.0 (no atomics), i.e. 17.0 for 1.1 and 19.8 for 1.0.

When the input vector contains constant value, computation with or without global atomics take about the same time, 278.8 ms.

I couldn’t test with computation capability 1.2 which has shared memory atomics, I guess that would likely be the best combination. With very low latency, the overhead of atomics has to be much lower than the 400+ cycles for external memory.

cbuchner1 · August 1, 2008, 3:11pm

So then the issue we see may be shared memory bank conflicts mostly, when the per-warp or per-block histogram is formed.

Topic		Replies	Views
Shared memory write conflicts Looking for a little help... CUDA Programming and Performance	5	4933	September 7, 2007
Histogram computation CUDA Programming and Performance	7	6382	February 8, 2008
Best way to do scatter write without memory conflict? CUDA Programming and Performance	6	5758	April 2, 2009
shared mem atomics with cc1.1 CUDA Programming and Performance	1	2742	March 10, 2010
Mutual exclusion or Reduction on global memory? CUDA Programming and Performance	13	9136	September 13, 2008
Having a little trouble with mutex/synchronisation CUDA Programming and Performance	29	57513	June 5, 2007
measure conflict when performing non-atomic write to the global memory CUDA Programming and Performance	0	3042	June 2, 2009
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25740	July 5, 2007
How can I control concurrent write access to global memory? CUDA Programming and Performance	2	896	February 22, 2010
Shared Memory Problems ... Conflict free access CUDA Programming and Performance	22	3534	August 24, 2010

Performance of Histogram256 demo Atomic writes are slow when conflict

Related topics