Avoiding global memory ordering by __syncthreads

Hi folks!

I’m currently working on kernels having a complex global memory load and store structure. Moreover, those kernels use several syncthread blocks with shared memory in order to share the interim results within each thread block. However, the compiler seemingly does not move global memory accesses between syncthreads blocks. This enforces a ordering of those global memory accesses, which my kernels do not require. Unfortunately, optimizing the performance manually by moving the global memory accesses between syncthreads blocks is a cumbersome task. Consequently I’d really like to pass this task to the compiler. Is there any solution to my problem in CUDA?

Regards Nai

You can try using bar.sync from inline assembly without declaring the memory clobber. The difficult part will then be to teach the compiler not to reorder the shared memory accesses around those bits of inline assembly. Declaring the shared memory variables as volatile should achieve that, but may negate any performance gained from the global memory reordering.