So I was running some of the sdk samples through Ocelot’s race detector and noticed the following problem in transposeNew.cu:
normal@atom:~/checkout/gpuocelot/tests/cuda2.2$ ./TransposeNew
CUDA device has 1 Multi-Processors
Matrix size: 128x128, tile size: 32x32, block size: 32x8
Kernel Loop over kernel Loop within kernel
------ ---------------- ------------------
simple copy 0.00 GB/s 0.00 GB/s
shared memory copy 0.00 GB/s 0.00 GB/s
naive transpose 0.00 GB/s 0.00 GB/s
coalesced transpose 0.00 GB/s 0.00 GB/s
==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z24transposeNoBankConflictsPfS_iii" with exception:
==Ocelot== [PC 64] [thread 1] [cta 0] st.shared.f32 [%r33 + 0], %r61 - Shared memory race condition, address 0x4 was previously read by thread 32 without a memory barrier in between.
==Ocelot== Near /home/normal/checkout/gpuocelot/tests/cuda2.2/tests/transposeNew/transposeNew.cu:180:0
==Ocelot==
The offending code is here:
for (int r=0; r < nreps; r++) {
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width];
}
__syncthreads();
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i];
}
}
Tile is a block of shared memory. There is a race condition where thread 32 reads from tile[1] at the end of one iteration and then thread 1 writes to tile[1] on the next iteration. Depending on the scheduling of warps, these may occur in the wrong order and cause the wrong value to be written out to odata. I fixed my local copy by adding a __syncthreads() at the end of the outer loop.
Nice to see that NVIDIA devs also make this error. :) I have been burned by forgetting the extra __syncthreads() at the end of the loop multiple times…
Nice to see that NVIDIA devs also make this error. :) I have been burned by forgetting the extra __syncthreads() at the end of the loop multiple times…
Hi,
Out of curiousity - can you please explain (as slow as possible, so I can try to understand :) ) how in God’s name does Ocelot find such things??? :)
Also - Is the windows version you’ve posted in another thread, works?
Hi,
Out of curiousity - can you please explain (as slow as possible, so I can try to understand :) ) how in God’s name does Ocelot find such things??? :)
Also - Is the windows version you’ve posted in another thread, works?
Ocelot has the ability to monitor every PTX instruction and memory transaction as it emulates them. So it just adds metadata to shared memory addresses to mark when it’s been accessed by who, and panic if someone reads the memory after another thread has written without an intervening syncthreads().
Ocelot has the ability to monitor every PTX instruction and memory transaction as it emulates them. So it just adds metadata to shared memory addresses to mark when it’s been accessed by who, and panic if someone reads the memory after another thread has written without an intervening syncthreads().
Hi. I’ve studied the transpose code + documentation before, but even without doing so, it’s not hard to decide that this race condition will not affect correctness. Why?
Because the same data will get loaded into tile for every iteration of the outer loop. I believe they wanted to amortize some overheads when timing, so they do the transpose several times (nrep).
Hi. I’ve studied the transpose code + documentation before, but even without doing so, it’s not hard to decide that this race condition will not affect correctness. Why?
Because the same data will get loaded into tile for every iteration of the outer loop. I believe they wanted to amortize some overheads when timing, so they do the transpose several times (nrep).
I wouldn’t say that it is fully functional on windows (you don’t get the LLVM JIT or the NVIDIA and AMD GPU backends), but you do get the emulator and everything that comes with it (including the race detector). I haven’t really rigorously tested it, but I was able to build it in VS2008 and run through a few example CUDA programs.
I wouldn’t say that it is fully functional on windows (you don’t get the LLVM JIT or the NVIDIA and AMD GPU backends), but you do get the emulator and everything that comes with it (including the race detector). I haven’t really rigorously tested it, but I was able to build it in VS2008 and run through a few example CUDA programs.
Maybe you should make this an option that can be enabled / disabled as desired (reporting same data writes could be disabled as default)? Because I can see some cases where you are writing say a flag variable or something where there may be a race condition, but you are just getting lucky that you are overwriting with the same data in a specific case.
Maybe you should make this an option that can be enabled / disabled as desired (reporting same data writes could be disabled as default)? Because I can see some cases where you are writing say a flag variable or something where there may be a race condition, but you are just getting lucky that you are overwriting with the same data in a specific case.