Is __threadfence(); useful at all?

like even if we make the __threadfence() call, we just ask all the values in shared to be available globally.
Does that mean if in a thread, after some operations, I call __threadfence(), then the execution will halt until the data is written back to global?
If this happens, is it not providing control synchronization?
Also if anyone could provide me with some reference as to hoe threadfence is implemented in assembly.
Thank you.

threadfence primarily affects ordering of visibility. It does not directly affect whether a change is visible or not. Therefore, no, it does not halt execution until something is visible.

Ordering means that, if a thread:

  • makes a change A to a global or shared item
  • executes a threadfence
  • makes another change B to that global or shared item

then there is no possibility that another thread will be able to read that item and observe B, and then later read that item and observe A.

I can assure you that guarantee is useful, in some cases. A typical example might be the threadfence reduction cuda sample code.

I’ve explicitly tried to steer clear of visibility itself. That is a more complicated topic.

Note that this isn’t a thorough or exhaustive description of threadfence. For that I refer you to the programming guide already linked.

The verbiage might be a little different depending on what case you have in view. For example, the threadfence reduction sample code is using threadfence to order access/visibility to separate locations in global memory. That guarantee would be worded somewhat differently.

1 Like

But then why would I bother using a threadfence? Why not just use syncthreads() or just declare the variable as volatile?
Is there any example of this being put to a non trivial use?
Thank you.

I suggest reading the programming guide sections on volatile, threadfence, and syncthreads, before going any further.

Certainly, if you need to use a syncthreads for some other reason, there is no need to use threadfence immediately prior to it. syncthreads already includes the semantics of threadfence, which you can discover by reading the programming guide.

However, if you read the programming guide, you will discover that syncthreads is both a memory barrier (the ordering stuff that we have been discussing so far, as well as an execution barrier. Should you arbitrarily stick an execution barrier in your code if you don’t need it? I wouldn’t.

volatile has the effect of causing any global read or write to that pointer to not be cached in the L1. Generally speaking this would be a global effect. That means volatile would typically be applied to a pointer passed to the kernel, and affects all kernel access. This is a heavy hammer, because this is potentially going to slow down every access to that location. If you only need a barrer (i.e. ordering guarantee) at a specific point in your code, if it were me, I would not want to penalize performance throughout a kernel to have an ordering guarantee at a specific point. I’m also not sure about the semantics of volatile with respect to memory ordering in the weak memory model. I’m not sure it offers the same guarantee as threadfence, unless perhaps you use it “everywhere”. Again, not what I would do. Do as you wish, of course.

could you come up with a scheme where you declare a local pointer as volatlle just at the point you need it? Perhaps, I don’t know. I’ve not ever been desperate to try to use volatile instead of threadfence. Why not use whichever one expresses the semantics you think you need?

If the threadfence reduction sample code is trivial, I don’t have any further suggestions. I personally think it elegantly shows a sensible use, and also elegantly shows how threadfence is preferred there over other methods.

The semantics of threadfence, syncthreads, and volatile are not all identical or interchangeable. Use whichever seems best to you. I don’t see a particular need to defend threadfence and am unlkely to mount any further defence. Use whichever seems best to you.

I apologize if my follow up question seemed rude. That was not my intension. I am new to the forum and appreciate your answers. I am also new to cuda and parallel programming as a whole and therefore did not understand some parts.