I have a kernel that performs an atomicCAS on pointers to dynamic memory in global memory (allocated using malloc in kernel).
It seems that atomicCAS writes are not immeduately visible to other threads. I even use __threadfence() after the call to atomicCAS.
I’m using Fermi architecture and CUDA 3.2.
I read in this forum that atomic operations forces writes to global memory and avoid caching. Is that true? How reliable is this?
What about the read? For example in atomicCAS does the read operation for comapring the first and second args is done directly from global memory?
It seems that when I allocate only 1 thread to each thread-block, I see very few failures in my code that may have been caused because of the visibility issue.
Another option that brings strong guarantees with it would to also do the read with an atomic instruction (e.g. [font=“Courier New”]atomicCAS([/font]…[font=“Courier New”], 0, 0)[/font]). However that would be complete overkill as the expensive write part of the atomic operation isn’t needed.