I’m currently working on kernels having a complex global memory load and store structure. Moreover, those kernels use several syncthread blocks with shared memory in order to share the interim results within each thread block. However, the compiler seemingly does not move global memory accesses between syncthreads blocks. This enforces a ordering of those global memory accesses, which my kernels do not require. Unfortunately, optimizing the performance manually by moving the global memory accesses between syncthreads blocks is a cumbersome task. Consequently I’d really like to pass this task to the compiler. Is there any solution to my problem in CUDA?