Is this the correct use of threadfence_block?

smokyboy · October 1, 2011, 10:30pm

I’ve come across an article that includes a pseudocode like this:

for (/âˆ—sorted body indexes assigned to meâˆ—/) {
depth = 0;
while (depth >= 0) {
while (/âˆ—there are more nodes to visitâˆ—/) {
if (/âˆ—Iâ€™m the ï¬rst thread in the warpâˆ—/) {
// move on to next node
// read node data and put in shared memory
}
threadfence block();
if (/âˆ—node is not nullâˆ—/) {
// get node data from shared memory
// use node data
}
}
}
}

I have shortened the code but the main part is there. What puzzles me here is the use of threadfence_block() which the authors explain as “a block-local memory fence ( threadfence block()) to prevent data reordering so that the threads in a warp can safely retrieve the data from shared memory.” Let us assume that the kernel is executed by a SINGLE WARP as indicated. Is threadfence_block() needed at all (even for theoretical correctness)?

What if the code was executed by several warps from a SINGLE BLOCK? Does threadfence_block() guarantee correct operation or should there be a syncthreads() instead? My suspicion is what happens if some warps have already advanced past the data reading stage - or is the warp order imposed and warp 0 (with thread 0) always remains “in the lead”?

There is another article in which some other authors use threadfence_block() because it is required for “intra-warp visibility of any pending writes”. In their code each thread in a warp copies several pieces of data from global to shared memory, calls threadfence_block() and then all threads in that same warp go on to operate on the data in shared memory. Is that threadfence_block() by any means necessary?

seibert · October 3, 2011, 1:16am

Without seeing the indexing pattern, I’m not sure if the usage is correct. Since the fence functions don’t provide any synchronization, they generally don’t work in the situations where people try to use them. If you are sending data from one thread to another thread within a block via shared memory, you basically have to use syncthreads(). The receiving thread has to actually wait at some line until the sending thread has finished the write to shared memory.

I can imagine some algorithms using polling that might be satisfied with just a fence to guarantee progress, and maybe your pseudocode falls into that category.

smokyboy · October 3, 2011, 6:44pm

According to the Programming Guide the threadfence functions are used to enforce the visibility of memory writes that could be executed out of order. If I point to the last paragraph in my first post, some authors actually use threadfence_block to enforce “intra-warp visibility of pending writes” which seems completely useless to me (the focus is on intra-warp), even if we assume that threads in the same warp are not executed in exact lock-step.

Regarding the given pseudocode the same reasoning applies - in my opinion the fence functions have no effect on the intra-warp visibility of writes, while for the inter-warp (i.e. intra-block or inter-block) they are too weak. I am not sure how different indexing patterns could make any difference. The example in the prog. guide is very specific, using atomics on shared flag variable and guaranteeing that all the data produced by one thread are properly stored before the flag is tested by other threads (in other blocks). I have not yet seen a single other sensible use of threadfence functions and the prog. guide should definitely elaborate on this. In the meantime I suspect there is a huge number of kernels out there that use the threadfence functions improperly either in cases when they are not needed at all or when syncthreads should be used instead.

Anyway thanks for your reply, if you could only explain what you mean by “guaranteeing progress with polling algorithms”.

smokyboy · October 5, 2011, 10:58am

The reply/view ratio of this post suggests that indeed no one understands this fencing stuff (and/or uses it).

seibert · October 5, 2011, 3:44pm

That is probably true. I have yet to need a fence instruction.

Topic		Replies	Views
Doubt on __threadfence() require a detail description of this function. CUDA Programming and Performance	5	3002	January 25, 2010
interblock sync without __threadfence() ? CUDA Programming and Performance	17	8605	May 7, 2009
difference between __threadfence_block and __syncthreads CUDA Programming and Performance	17	29535	April 22, 2015
Racecheck and threadfence CUDA-MEMCHECK	2	2993	June 17, 2013
__threadfence_block vs. __syncthreads CUDA Programming and Performance	1	4177	November 27, 2009
Problems with __threadfence CUDA Programming and Performance	2	3169	November 11, 2009
syncronize all threads from all blocks cudaThreadSynchronize() the only way ? CUDA Programming and Performance	11	8338	November 15, 2010
Do I need threadfence? CUDA Programming and Performance	4	1796	April 11, 2012
About __threadfence... CUDA Programming and Performance	1	1293	March 11, 2010
__threadfence() problem CUDA Programming and Performance	2	9502	January 11, 2011

Is this the correct use of threadfence_block?

Related topics