Invalid data returned during indirect access Happens with cmem/lmem/smem

  • Operating System: Ubuntu 7.10, 32bit

  • CUDA toolkit release version: CUDA Toolkit for Ubuntu 7.10

  • SDK release version: 2.0 Beta2

  • Driver version: 177.13

  • Compiler for CPU host code: g++ 4

  • System description: 2.8GHz Pentium-D, 512MB ram, Geforce 8800GT (512MB, if I recall correctly)

Note: everything described below works when compiled with device emulation mode! Described behavior is specific to GPU

Summary:

When using indirect memory access mode, as in:

i = a[i]

the returned data is either 00’s (usually) or garbage (rarely). Sometimes, all of the program data screws up. Access using immediate (based on known value of i) works.

Detailed description:

I have an array. Currently it’s stored in smem, but I’ve also tried cmem and lmem.

I have code that does

byte i;

...

i = a[i];

return i;

Since a is stored in smem, there’s a __syncthreads(); immediately after load of data to a is completed. No writes to shared memory are done since that moment (verified).

Now, when trying to perform the above, the value returned to host code is 0x00.

Based on a known value of i in the beginning of the call (verified using stub’ing the call and making it a passthrough), I tried accessing the memory using immediate:

...

i = a[0x12];

return i;

The value returned is then correct.

I’ve tried moving the array to cmem (no synchronization issues there). Again, when accessed using immediate, the data is correct. When accessed using indirect access, NOT ONLY the data is incorrect, but even if I return

0xFF | i

, I get 0x00 returned to host - and the rest of the data returned to host is 0x00 as well.

I’ve also tried defining a local array (verified as stored in lmem):

byte a[16] = { 0xF,0x10,1,2,3,4,5,6,7,8,9,0xA,0xB,0xC,0xD,0xE };

i = a[i];

return i;

Again, when using immediates, the access works. When using indirect access, 0x00 is returned. It is not even present in the array!

Then, I’ve tried the following equivalent (up to byte overflows, which do not happen in the testcase):

i = (a[i] + a[i])/2;

This WORKS.

I’ve looked at the generated .ptx source, the source code for first case is similar to

ld.shared %124, [a + %r123]

The source code for second case is:

ld.shared %124, [a + %r123]

add.u8 %125, %124,%124

shr.u8 %126,%125,1

The fetch is done only once, but the received value is correct. However, this is not 100% foolproof and sometimes fails as well.

Also, if smem is used, and the code is augmented like this:

byte i;

// no shared memory reads/writes done here

...

__syncthreads();

i = a[i];

return i;

the code works (also, not in all cases). Moving __syncthreads to the beginning of the function as in:

__syncthreads();

byte i;

...

i = a[i];

breaks the behavior again.

I’ve looked at the generated ptx, and the intermediate seems fine to me. Based on the described behavior, it seems like either a backend bug or some sort of fetch race condition (even in constant/local memory) going on in the code.

Unfortunately, I haven’t been able to coerce ptxas to generate human-readable CUDA assembly to verify this assumption - so, my question is

Has anyone ever encountered the above behavior? Could anything else have caused it? (as before, the only place where shared memory is written is during the initial load of the “a” array, and __syncthreads is called afterwards)

This has been a really annoying issue to debug (i’ve lost a full day narrowing down the cause). Any ideas?

Thanks,

Alex