Race Conditions in TransposeNew SDK Sample

So I was running some of the sdk samples through Ocelot’s race detector and noticed the following problem in transposeNew.cu:

normal@atom:~/checkout/gpuocelot/tests/cuda2.2$ ./TransposeNew 

CUDA device has 1 Multi-Processors

Matrix size: 128x128, tile size: 32x32, block size: 32x8

Kernel			Loop over kernel	Loop within kernel

------			----------------	------------------

simple copy		   	 0.00 GB/s		 0.00 GB/s

shared memory copy		 0.00 GB/s		 0.00 GB/s

naive transpose	   	 0.00 GB/s		 0.00 GB/s

coalesced transpose   	 0.00 GB/s		 0.00 GB/s

==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z24transposeNoBankConflictsPfS_iii" with exception: 

==Ocelot== [PC 64] [thread 1] [cta 0] st.shared.f32 [%r33 + 0], %r61 - Shared memory race condition, address 0x4 was previously read by thread 32 without a memory barrier in between.

==Ocelot== Near /home/normal/checkout/gpuocelot/tests/cuda2.2/tests/transposeNew/transposeNew.cu:180:0

==Ocelot==

The offending code is here:

for (int r=0; r < nreps; r++) {

	for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {

	  tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width];

	}

	__syncthreads();

	for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {

	  odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i];

	}

  }

Tile is a block of shared memory. There is a race condition where thread 32 reads from tile[1] at the end of one iteration and then thread 1 writes to tile[1] on the next iteration. Depending on the scheduling of warps, these may occur in the wrong order and cause the wrong value to be written out to odata. I fixed my local copy by adding a __syncthreads() at the end of the outer loop.

Nice to see that NVIDIA devs also make this error. :) I have been burned by forgetting the extra __syncthreads() at the end of the loop multiple times…

Nice to see that NVIDIA devs also make this error. :) I have been burned by forgetting the extra __syncthreads() at the end of the loop multiple times…

but why would the code PASS instead of FAIL?

but why would the code PASS instead of FAIL?

Hi,
Out of curiousity - can you please explain (as slow as possible, so I can try to understand :) ) how in God’s name does Ocelot find such things??? :)
Also - Is the windows version you’ve posted in another thread, works?

thanks
eyal

Hi,
Out of curiousity - can you please explain (as slow as possible, so I can try to understand :) ) how in God’s name does Ocelot find such things??? :)
Also - Is the windows version you’ve posted in another thread, works?

thanks
eyal

Ocelot has the ability to monitor every PTX instruction and memory transaction as it emulates them. So it just adds metadata to shared memory addresses to mark when it’s been accessed by who, and panic if someone reads the memory after another thread has written without an intervening syncthreads().

See hereand here.

Ocelot has the ability to monitor every PTX instruction and memory transaction as it emulates them. So it just adds metadata to shared memory addresses to mark when it’s been accessed by who, and panic if someone reads the memory after another thread has written without an intervening syncthreads().

See hereand here.

External Media

External Media

Hi. I’ve studied the transpose code + documentation before, but even without doing so, it’s not hard to decide that this race condition will not affect correctness. Why?

Because the same data will get loaded into tile for every iteration of the outer loop. I believe they wanted to amortize some overheads when timing, so they do the transpose several times (nrep).

Hi. I’ve studied the transpose code + documentation before, but even without doing so, it’s not hard to decide that this race condition will not affect correctness. Why?

Because the same data will get loaded into tile for every iteration of the outer loop. I believe they wanted to amortize some overheads when timing, so they do the transpose several times (nrep).

Yep, you are right about this one Uncle Joe. Thanks for keeping me honest.

I went back and updated the race detector to ignore writes that have the same data, so it shouldn’t report false positives like this anymore.

Yep, you are right about this one Uncle Joe. Thanks for keeping me honest.

I went back and updated the race detector to ignore writes that have the same data, so it shouldn’t report false positives like this anymore.

I wouldn’t say that it is fully functional on windows (you don’t get the LLVM JIT or the NVIDIA and AMD GPU backends), but you do get the emulator and everything that comes with it (including the race detector). I haven’t really rigorously tested it, but I was able to build it in VS2008 and run through a few example CUDA programs.

I wouldn’t say that it is fully functional on windows (you don’t get the LLVM JIT or the NVIDIA and AMD GPU backends), but you do get the emulator and everything that comes with it (including the race detector). I haven’t really rigorously tested it, but I was able to build it in VS2008 and run through a few example CUDA programs.

Maybe you should make this an option that can be enabled / disabled as desired (reporting same data writes could be disabled as default)? Because I can see some cases where you are writing say a flag variable or something where there may be a race condition, but you are just getting lucky that you are overwriting with the same data in a specific case.

Maybe you should make this an option that can be enabled / disabled as desired (reporting same data writes could be disabled as default)? Because I can see some cases where you are writing say a flag variable or something where there may be a race condition, but you are just getting lucky that you are overwriting with the same data in a specific case.

Thanks for the suggestion. Just added it to the trunk.