The different behaviors of different scopes' CUDA Memory Fence?

I can find the CUDA spec defines different behaviors for the fence of different scopes.

Is that correct? And why?

Let’s define some cases.

  • (WW case) Write0 → Fence → Write1
  • (RW case) Read0 → Fence → Write1
  • (RR case) Read0 → Fence → Write1
  • (WR case) Read0 → Fence → Read1

__threadfence_block forbiddens

  • WW Case: Write0 is reorderd after Write1
  • RR Case: Read0 is reorderd after Read1

void __threadfence(); forbiddens

  • WW Case: Write1 is reorderd before Write0

void __threadfence_system(); forbiddens

  • WW Case: Write1 is reorderd after Write0

From CUDA Spec:

void __threadfence_block();

  • All writes to all memory made by the calling thread before the call to __threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call to __threadfence_block();
  • All reads from all memory made by the calling thread before the call to __threadfence_block() are ordered before all reads from all memory made by the calling thread after the call to __threadfence_block().

void __threadfence();

is equivalent to cuda::atomic_thread_fence(cuda::memory_order_seq_cst, cuda::thread_scope_device) and ensures that no writes to all memory made by the calling thread after the call to __threadfence() are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to __threadfence().

void __threadfence_system();
is equivalent to cuda::atomic_thread_fence(cuda::memory_order_seq_cst, cuda::thread_scope_system) and ensures that all writes to all memory made by the calling thread before the call to __threadfence_system() are observed by all threads in the device, host threads, and all threads in peer devices as occurring before all writes to all memory made by the calling thread after the call to __threadfence_system().

I assume by “forbiddens” you meant “prevents”

Yes, the difference is correct. At least there has been a slight difference in description for these APIs for quite some time. The threadfence_block case has provided an additional statement, not provided at/for the other scopes. I don’t know “why”.