Since the reads from a always accesses indices greater than or equal to i, no iteration of the loop depends on the previous iteration. Therefore, you should be able to do this trivially with double buffering. Create an array b to hold the results, and have each thread run:
i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < n - 3) {
b[i] = a[i] + a[i+1] + a[i+2] + a[i+3];
}
No more race condition. To make this code fast on non-Fermi GPUs, you’ll want to use a 1D texture instead of directly accessing a, or perhaps do something like load the relevant section of a into shared memory and then all the threads can access it.
It does seem to win, but the cache on Fermi definitely closes the gap. I implemented the above two ways: a simple-minded kernel that does 4 global memory reads and 1 write per element, and a kernel that uses shared memory as a staging area with 1+epsilon global memory read and 1 write per element. (The epsilon is because you have to read 3 elements beyond the size of the block to compute the output sum.) Note that the MB/sec listed below is “MB of data in the input array / runtime of kernel” and not a measure of the actual total memory transfer rate.
The shared memory implementation still wins, but the simple version is only 30% slower rather than 2.5x slower. The cache seems to be doing a pretty good job here.
I’m not quite sure what is driving the 30% gap, since I expect we are hitting the L1 cache almost exclusively. It should be just as fast as shared memory, unless the staggered reads are causing extra memory transactions to be generated for each warp.
Anyway, nice to see that the new cache is reducing (but not yet eliminating) the need for shared memory fanciness to get reasonable performance.
Seibert- Thks for this drop-in mini-benchmark test. Would be great if the community could collect a number of generic benchmarks for various memory usages besides the somewhat specialized but excellent SDK examples. Esp. for those of us looking up the learning curve. V.
Meaning if there is enough locality for accesses to hit L1 most of the time, but the access patterns would cause many smem bank conflicts, then cache would actually win. Interesting. Although it makes some smem based solutions not to scale to Fermis as well as you would like them to.
Actually it’s quite obvious if you think about it: if you can control access patterns then software-managed cache (smem) wins. Otherwise hardware-managed cache wins.