Is this the correct use of threadfence_block?

I’ve come across an article that includes a pseudocode like this:

for (/∗sorted body indexes assigned to me∗/) {
depth = 0;
while (depth >= 0) {
while (/∗there are more nodes to visit∗/) {
if (/∗I’m the first thread in the warp∗/) {
// move on to next node
// read node data and put in shared memory
}
threadfence block();
if (/∗node is not null∗/) {
// get node data from shared memory
// use node data
}
}
}
}

I have shortened the code but the main part is there. What puzzles me here is the use of threadfence_block() which the authors explain as “a block-local memory fence ( threadfence block()) to prevent data reordering so that the threads in a warp can safely retrieve the data from shared memory.” Let us assume that the kernel is executed by a SINGLE WARP as indicated. Is threadfence_block() needed at all (even for theoretical correctness)?

What if the code was executed by several warps from a SINGLE BLOCK? Does threadfence_block() guarantee correct operation or should there be a syncthreads() instead? My suspicion is what happens if some warps have already advanced past the data reading stage - or is the warp order imposed and warp 0 (with thread 0) always remains “in the lead”?

There is another article in which some other authors use threadfence_block() because it is required for “intra-warp visibility of any pending writes”. In their code each thread in a warp copies several pieces of data from global to shared memory, calls threadfence_block() and then all threads in that same warp go on to operate on the data in shared memory. Is that threadfence_block() by any means necessary?

Without seeing the indexing pattern, I’m not sure if the usage is correct. Since the fence functions don’t provide any synchronization, they generally don’t work in the situations where people try to use them. If you are sending data from one thread to another thread within a block via shared memory, you basically have to use syncthreads(). The receiving thread has to actually wait at some line until the sending thread has finished the write to shared memory.

I can imagine some algorithms using polling that might be satisfied with just a fence to guarantee progress, and maybe your pseudocode falls into that category.

According to the Programming Guide the threadfence functions are used to enforce the visibility of memory writes that could be executed out of order. If I point to the last paragraph in my first post, some authors actually use threadfence_block to enforce “intra-warp visibility of pending writes” which seems completely useless to me (the focus is on intra-warp), even if we assume that threads in the same warp are not executed in exact lock-step.

Regarding the given pseudocode the same reasoning applies - in my opinion the fence functions have no effect on the intra-warp visibility of writes, while for the inter-warp (i.e. intra-block or inter-block) they are too weak. I am not sure how different indexing patterns could make any difference. The example in the prog. guide is very specific, using atomics on shared flag variable and guaranteeing that all the data produced by one thread are properly stored before the flag is tested by other threads (in other blocks). I have not yet seen a single other sensible use of threadfence functions and the prog. guide should definitely elaborate on this. In the meantime I suspect there is a huge number of kernels out there that use the threadfence functions improperly either in cases when they are not needed at all or when syncthreads should be used instead.

Anyway thanks for your reply, if you could only explain what you mean by “guaranteeing progress with polling algorithms”.

The reply/view ratio of this post suggests that indeed no one understands this fencing stuff (and/or uses it).

That is probably true. I have yet to need a fence instruction.