Race Conditions

Hello Everyone,

I have to write a kernel to parallelize the following code and each time I try there seems to be race condition.

Can you please suggest an optimal way to write the kernel for this code:

for(int i =0; i<n-3;i++)
{
a[i] = a[i] + a[i+1] + a[i+2] + a [i+3];
}

Thanks and Regards

Since the reads from a always accesses indices greater than or equal to i, no iteration of the loop depends on the previous iteration. Therefore, you should be able to do this trivially with double buffering. Create an array b to hold the results, and have each thread run:

i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < n - 3) {

  b[i] = a[i] + a[i+1] + a[i+2] + a[i+3];

}

No more race condition. To make this code fast on non-Fermi GPUs, you’ll want to use a 1D texture instead of directly accessing a, or perhaps do something like load the relevant section of a into shared memory and then all the threads can access it.

IMO smem wins, fermi or not. no bank conflicts and only one gmem read and write per element (and will be coalesced).

It does seem to win, but the cache on Fermi definitely closes the gap. I implemented the above two ways: a simple-minded kernel that does 4 global memory reads and 1 write per element, and a kernel that uses shared memory as a staging area with 1+epsilon global memory read and 1 write per element. (The epsilon is because you have to read 3 elements beyond the size of the block to compute the output sum.) Note that the MB/sec listed below is “MB of data in the input array / runtime of kernel” and not a measure of the actual total memory transfer rate.

Device name: GeForce GTX 275

BogoGFLOPS: 699.8

Size of array: 16776960 elements

Simple sum: 4.221 ms, 3790.7 MB/sec

Shared sum: 1.808 ms, 8851.1 MB/sec

We would expect the “shared sum” to be a little less than 2.5x faster, and we get 2.33x.

Device name: GeForce GTX 470

BogoGFLOPS: 1088.6

Size of array: 16776960 elements

Simple sum: 2.080 ms, 7692.1 MB/sec

Shared sum: 1.616 ms, 9898.9 MB/sec

The shared memory implementation still wins, but the simple version is only 30% slower rather than 2.5x slower. The cache seems to be doing a pretty good job here.

I’m not quite sure what is driving the 30% gap, since I expect we are hitting the L1 cache almost exclusively. It should be just as fast as shared memory, unless the staggered reads are causing extra memory transactions to be generated for each warp.

Anyway, nice to see that the new cache is reducing (but not yet eliminating) the need for shared memory fanciness to get reasonable performance.

(Code for this test is at http://bitbucket.org/seibert/fermi_test/src/tip/cache.cu )

O M G

does this mean OMG the L1 cache is useful or does it mean OMG I still have to deal with shared memory to get best performance?

This “O M G” for “Satyam Shivam” - the starter of this topic. I know him and there is a reason for that OMG that he will understand…

And yeah, truly appreciate Seibert’s and others’ time on it.

Seibert- Thks for this drop-in mini-benchmark test. Would be great if the community could collect a number of generic benchmarks for various memory usages besides the somewhat specialized but excellent SDK examples. Esp. for those of us looking up the learning curve. V.

Meaning if there is enough locality for accesses to hit L1 most of the time, but the access patterns would cause many smem bank conflicts, then cache would actually win. Interesting. Although it makes some smem based solutions not to scale to Fermis as well as you would like them to.

Actually it’s quite obvious if you think about it: if you can control access patterns then software-managed cache (smem) wins. Otherwise hardware-managed cache wins.

Thanks a lot everyone ! I got it working … and did some comparisons with tesla C1060 too…

Thanks seibert!

How can one deal a situation if the iterations of the loop depend on the previous iterations ?

Can you please give me an example !!

You cant parallel such a scenario :(