It is satisfying to design code that flys, but sometimes it is complicated to do so and the application won’t be used often enough to warrant a lot of effort.
To help me decide when a simple approach (say using uncoalesced reads and atomic updates ) might do the job I needed some performance data.
My GeForce GTX 285 gave the following
Random reads directly from global memory ~250 million per second. #1
atomic functions, average was ~2.5 million per second. #2
Sorry if this is already somewhere in the documentation, if it is and you know where please let me know,
#1 was pretty constant across a range of conditions. e.g. I varied the number of threads doing the reads from 1 thread/block to 64/block and the throughput only varied a few %.
The code I used reads a large (I tried both 400MB and 1600MB) pre configured array of long. ‘limit’ and number of threads used was long enough that the GPU took about 4.7 seconds
for ( int ii=0; ii < limit; ii++)
cell = gData[cell];
#2 This is calculated from the time taken by the SDK simpleAtomicIntrinsics project, which runs 11 different atomic functions. It is probably a bit of an underestimate as the run time was only 74ms, so there would have been a large impact from kernel set up time.