I seem to constantly forget that gpu cache is not cpu cache
My kernel executes a number of functions sequentially; the functions share some data via shared memory, and some of the data that can not fit in shared memory, via global memory
I believe subsequent functions manage to read the global memory data before preceding functions have finished updating the data - the former half of the data array read by the subsequent function is updated, the latter part not
Is this because of the kernel using functions? I presume one can not expect global memory accesses to be ordered?
Changing the functions to kernels or child kernels should solve the matter, not so?
Would inlining the functions help?