Using surfaces in a stack implementation


I wrote a kernel that is using surfaces for implementing a stack. In a push/pop-manner I receive my data in this kernel. Every thread accesses only its own unique memory location in the surface. So I am not accessing some memory location that another thread is writing to. Now, I read in the programming guide that I should get undefined behavior.

I refer to these two chapters:

Programming Guide Read/Write Coherency
The texture and surface memory is cached (see Device Memory Accesses) and within
the same kernel call, the cache is not kept coherent with respect to global memory
writes and surface memory writes, so any texture fetch or surface read to an address
that has been written to via a global write or a surface write in the same kernel call
returns undefined data. In other words, a thread can safely read some texture or surface
memory location only if this memory location has been updated by a previous kernel
call or memory copy, but not if it has been previously updated by the same thread or
another thread from the same kernel call.

Best Practices Guide
9.2. Device Memory Spaces
In the case of texture access, if a texture reference is bound to a linear array in global
memory, then the device code can write to the underlying array. Texture references that
are bound to CUDA arrays can be written to via surface-write operations by binding
a surface to the same underlying CUDA array storage). Reading from a texture while
writing to its underlying global memory array in the same kernel launch should be
avoided because the texture caches are read-only and are not invalidated when the
associated global memory is modified.

My implemented kernel is working as expected and I do not get any error. I would like to know, if it works by accident or if there is not such a problem, if one thread is only reading and writing its unique memory location? Does a write access to a surface invalidate this memory location for the cache? What other undefined value should be obtained from the surface reading if the writing is not asynchronous and no other thread is writing to the memory location?

I wrote a small test kernel and for sure I get undefined results if one thread writes to all memory locations of the surface and all threads read some part of it, but for the case of a unique memory location I could not reproduce this undefined behavior which is contradictory to the programming guide.

Best regards,

I seem to be remember to have been able to implement a matrix transpose using surface read-writes. I was not sure if I was dabbling in areas of undefined behaviour or not, but it did work, perhaps by coincidence.

Is a surface a better option than using local memory (and relying on the cache)?
Also you can put the stack in shared memory
shared memory => GTX 8800

Local memory => (pre-CUDA)

Ok thank you. I tried it now with local, shared and global memory. Global memory performed similar to surfaces. So i will use global now…