__threadfence_block() vs __threadfence() ?

there are 2 difference memory fence function __threadfence_block() and __threadfence(). I am confused about what is the difference of them when they fence global memory operations.
IMHO, when read/store from global memory, __threadfence_block() and __threadfence() just guarantee the consistency within current thread (current thread refers to who are calling theses fence functions). it only matters about the order of memory operations of current thread, looking from other threads (not just within same block). and it never impacts the execution of others threads. is it right? if yes, the functionality should be same for __threadfence_block() and __threadfence() when the memory operations are all for global memory.
if not, is there some example to tell the difference? i.e. some scenario make __threadfence_block() fail but __threadfence() work. Thanks

__threadfence() acts as __threadfence_block() for all threads in the block of the calling thread (CUDA C++ Programming Guide, page 134)

Also curious about this question, coykd you explain a little bit more? I don’t understand what “visible to other threads” means. From the CUDA programming guide, seems the explanation and example only illustrate that it ensures the ordering of memory instructions.

I don’t understand what “visible to other threads” means.

That seems to be the first usage of those words in this thread.

From here, the guarantees provided are as follows (for __threadfence_block()):

  • All writes to all memory made by the calling thread before the call to __threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call to __threadfence_block();
  • All reads from all memory made by the calling thread before the call to __threadfence_block() are ordered before all reads from all memory made by the calling thread after the call to __threadfence_block().

Note later in that section it explicitly states:

Memory fence functions only affect the ordering of memory operations by a thread; they do not, by themselves, ensure that these memory operations are visible to other threads

Thanks for reply. Sorry I referenced the wrong words, I mean I don’t understand “are observed by all threads in the block” from CUDA programming guide. But from your answser, seems that 's just a metaphor for the behavior and no real “obeserving” process in the hardware, right?

So the only difference of "__threadfence_block"is that it also ensures read operation sequence? But what about “__threadfence_system” ? Seems if under the condition that memory fence only affect ordering of the memory operations, then “__threadfence_system” and “__threadfence” functionally should be the same thing?

Suppose that I have a thread in a threadblock, and there is a global memory location d identified to the kernel as follows. Let’s also assume that before the kernel launch, the global memory location was initialized to zero:

__global__ void k(int *d){
  if ((blockIdx.x == 0) && (threadIdx.x == 0)) {
    *d = 2;
    __threadfence();
    *d = 3;
    __threadfence();
    *d = 4;}
    ...

For simplicity, let’s also assume that the updates to location d indicated above are the only instances of kernel code for kernel k writing to location d. Let’s also assume that no other code anywhere in the system is writing to location d while kernel k is running.

First of all the statement in the programming guide

means that just by studying this code, we cannot determine when the updates of 2, 3, or 4 will be visible to other threads.

But what the “guarantee” does communicate is that if I have another thread (other than the one indicated by block 0, thread 0), and if that other thread reads a value of 3 from that location, there is no possibility that it will, at a later time, read the value of 2 from that location.

That is an example of the ordering guarantee provided by __threadfence().

When we have __threadfence() by itself, the scope of that guarantee is device wide, for all CUDA device threads. When we have __threadfence_block(), the scope of that guarantee is provided for threads in the same block as the one indicated as block 0. When we have __threadfence_system() the scope is different than either of the above 2 scopes, as discussed in the programming guide.

I’m not going to respond to further requests to suggest equality between these. They are not the same guarantee, and I’ve indicated as clearly as I know how what the nature of the guarantee is.

If you want to continue to state

you can do that, but I don’t agree with that statement.

Good luck!

1 Like

Thanks for the reply. Anyway I guess I found answer towards this question: L1 Cache, L2 Cache and Shared memory in Fermi - #3 by hyqneuron, under the reply from seibert. Basically quote:

  • You can force the L1 cache to flush back up the memory hierarchy using the appropriate _threadfence*() function. __threadfence_block() requires that all previous writes have been flushed to shared memory and/or the L1. __threadfence() additionally forces global memory writes to be visible to all blocks, and so must flush writes up to the L2. Finally, __threadfence_system() flushes up to the host level for mapped memory.

Hope it helps for all other people who are interested in this question.