I have what looks like false positive within racecheck, but I’d like to check my understanding.
I have a kernel that I have reduced to run with a single warp in each block (i.e. dimensions 32x1), and use shared variables essentially as warp-level variables. If I fence reads and writes using __threadfence_block(), racecheck prints huge numbers of warnings about hazards, but simply replacing these calls with __syncthreads() makes the warnings go away. Other than racecheck warnings, I don’t see any behaviour difference between the two approaches.
My understanding from the docs is that __threadfence_block() is sufficient to ensure that all threads have a consistent view of shared memory after it is called. Is this incorrect, or is this a false positive from racecheck?
(The normal version of this code needs to run with more than one warp per block (indexing into shared memory using warp index), so I’d like to check that using __threadfence_block() is still sufficient to fence reads and writes for all threads of a single warp. Sadly warps diverge so I cannot use __syncthreads() in this case.)