Global memory access requests ordered..?

Hello,

I seem to constantly forget that gpu cache is not cpu cache

My kernel executes a number of functions sequentially; the functions share some data via shared memory, and some of the data that can not fit in shared memory, via global memory

I believe subsequent functions manage to read the global memory data before preceding functions have finished updating the data - the former half of the data array read by the subsequent function is updated, the latter part not

Is this because of the kernel using functions? I presume one can not expect global memory accesses to be ordered?
Changing the functions to kernels or child kernels should solve the matter, not so?
Would inlining the functions help?

Within a CUDA thread, you should see a consistent view of memory, regardless of caching. This makes me suspect a race condition between threads accessing the same memory locations. Is one thread reading a memory location that is written to by another thread?

A block of threads of the preceding function process data and write to an array in global memory
The threads of the subsequent function then need to read the array in global memory to execute their part
I think I might have taken the snapshot of the array prior to the __syncthreads() barrier whilst debugging
At least I now know that it must be a bug, rather than a principal design flaw

Thanks, seibert