Thanks for the reply. Anyway I guess I found answer towards this question: L1 Cache, L2 Cache and Shared memory in Fermi - #3 by hyqneuron, under the reply from seibert. Basically quote:
- You can force the L1 cache to flush back up the memory hierarchy using the appropriate _threadfence*() function. __threadfence_block() requires that all previous writes have been flushed to shared memory and/or the L1. __threadfence() additionally forces global memory writes to be visible to all blocks, and so must flush writes up to the L2. Finally, __threadfence_system() flushes up to the host level for mapped memory.
Hope it helps for all other people who are interested in this question.