"Unnecessary" synchronization required, but only on some cards?

I have found a situation where a __threadfence_block() call is required, but only on some cards. More confusingly, it is my understanding that the __threadfence_block() call should not be required on any card.

The code listed below works has been tested and always works properly on both a 780M and 1080 (without call to __threadfence_block()). However, it never works correctly without that call on a 970M (tested with CUDA 7.5 on Ubuntu 15.10).

The kernel performs a modified 2-Nearest Neighbors query (under the Hamming norm) for 2 sets of bitvectors of length 512 (this is useful in computer vision). The error occurs at the top of a partially unrolled loop. In the loop, a const uint64_t value is read from global memory, XOR’d against several other const register uint64_t, and each result passed to __popcll(). The results are then packed and reduced with __shfl_xor().

On the 780M (tested on Windows) and 1080 (tested on Linux) this kernel always finds all matches correctly, and never finds any matches on my 970M… unless a __threadfence_block() is added.

This __threadfence_block() may be placed anywhere before the final if statement, even before the read itself. This seems very strange to me. Adding calls to __threadfence_block() before and after the loop does not change this behavior.

What is stranger still is that I added a call to CUDA’s printf() in the kernel, wrapped by an if statement that selected only thread 0 in warp 0 in block 0. This also fixed the problem, for every block, without a __threadfence_block().

The __threadfence_block() call only lowers performance by ~8%, but this is performance critical code in an O(n^2) task meant to be released as a library.

Why is this happening? Is this a bug?

Full project (just call “make run”): https://drive.google.com/file/d/0B40zIKz22S79M1ljd21FZ2Y0eDA/view?usp=sharing (if you see a very small numbers of matches, lower the threshold defined at the start of main.cpp then “make clean; make run”)

__global__ void CUDAMATCH_kernel(const uint64_t* const __restrict__ g_query, const int num_q,
	const uint64_t* const __restrict__ g_training, const int num_t, int* const __restrict__ g_match, const int threshold) {

	register uint32_t offset = ((threadIdx.x & 24) << 3) + (threadIdx.x & 7) + (blockIdx.x << 11) + (threadIdx.y << 8);
	const register uint64_t qa = g_query[offset     ];
	const register uint64_t qb = g_query[offset +  8];
	const register uint64_t qc = g_query[offset + 16];
	const register uint64_t qd = g_query[offset + 24];
	const register uint64_t qe = g_query[offset + 32];
	const register uint64_t qf = g_query[offset + 40];
	const register uint64_t qg = g_query[offset + 48];
	const register uint64_t qh = g_query[offset + 56];

	register int best_i = -1;
	register int best_v   = 100000;
	register int second_v = 200000;

	#pragma unroll 7
	for (int t = 0; t < num_t; ++t) {
		// __threadfence_block(); // Adding this call before the read fixes things.
		const register uint64_t train = g_training[(t << 3) + (threadIdx.x & 7)];
		register int dist0 = __popcll(qa ^ train) | ((__popcll(qe ^ train)) << 16);
		register int dist1 = __popcll(qb ^ train) | ((__popcll(qf ^ train)) << 16);
		register int dist2 = __popcll(qc ^ train) | ((__popcll(qg ^ train)) << 16);
		register int dist3 = __popcll(qd ^ train) | ((__popcll(qh ^ train)) << 16);

		dist0 += __shfl_xor(dist0, 1);
		dist1 += __shfl_xor(dist1, 1);
		if (threadIdx.x & 1) dist0 = dist1;
		dist2 += __shfl_xor(dist2, 1);
		dist3 += __shfl_xor(dist3, 1);
		if (threadIdx.x & 1) dist2 = dist3;
		dist0 += __shfl_xor(dist0, 2);
		dist2 += __shfl_xor(dist2, 2);
		if (threadIdx.x & 2) dist0 = dist2;
		dist0 = ((dist0 + __shfl_xor(dist0, 4)) >> ((threadIdx.x & 4) << 2)) & (((4 ^ (threadIdx.x & 4)) << 9) - 1);

		if (dist0 < second_v) second_v = dist0;
		// __threadfence_block(); // Can also add it here, or anywhere in between.
		if (dist0 < best_v) {
			second_v = best_v;
			best_v = dist0;
			best_i = t;

	if (second_v - best_v <= threshold) best_i = -1;
	const register int idx = (blockIdx.x << 8) + (threadIdx.y << 5) + threadIdx.x;
	if (idx < num_q) g_match[idx] = best_i;

At first glance, this looks like a purely warp-centric kernel and OK to me.

I also wouldn’t expect you would need a __threadfence_block().

I assume your g_match array doesn’t alias g_query and g_training?

Another thought is that there might be a bug in the compiler. Maybe try a lower optimization level or the CUDA 8 compiler just to verify? Some of the 64-bit ops you’re using (shuffles, popcll, adds, etc.) are going to be multiple instructions so maybe there is more of an opportunity for a bug.

Or try removing the restrict attribs.

One thing that I’ve tried before is compiling to a virtual arch (instead of a specific arch) and see if the driver generates better SASS.

Side remark: Use of the ‘register’ keyword does nothing, it is routinely ignored by all modern compilers (OK, they may complain if you try to take the address of an object declared ‘register’).

As allanmac points out, the first thing to check is whether the programmer is keeping the promise made by use of restrict. Other than that, a typical error scenario occurring in warp synchronous code is to have an implicit inter-thread read-after-write dependency that isn’t visible to the compiler, which has a per-thread view of the world. Without a visible dependency, the compiler could then schedule a read from loop iteration N+1 ahead of the “corresponding” write in loop iteration N, except the compiler has no idea of the correspondence, as it is not expressed in the code (there is no notion of warp synchronous execution in the compiler’s static analysis, that’s purely a run-time execution thing).

In looking at the code, I can’t immediately spot an instance of such a problem, however. One experiment you could perform is to force a fully rolled loop, by use of ‘#pragma unroll 1’ and check whether that makes the problem go away as well. If it does, I would say that is a pretty good indication that there is a read-after-write dependency between loop iterations that is intended but not expressed in the code.

Another thing to try is to run the code under control of cuda-memcheck and turn on race checking. It is really good at finding race conditions that escape source code inspections, i.e. are non obvious.

Lastly, compiler bugs are always possible, but given the maturity of the CUDA toolchain, somewhat unlikely at this stage.

AFAIK cuda-memcheck racecheck tool (currently) only detects races involving shared memory:


I would take a closer look, but the host code includes avx intrinsics that are incompatible with some of the hardware I would need to use.

First i wrote stupid comment recommending to optimize this:

dist0 += __shfl_xor(dist0, 1);
dist1 += __shfl_xor(dist1, 1);
if (threadIdx.x & 1) dist0 = dist1;

to that:

if (threadIdx.x & 1) dist0 = dist1;
dist0 += __shfl_xor(dist0, 1);

and the same for each next 3 lines

Then i deleted it. Then i realized that compiler is no smarter than me. It can do the same! Look at the SASS code, Luke :D

I still think about optimization of your kernel…

  1. Line 37 can be replaced with single byte_permute operation that dublicates either higher or lower half of register depending on x&4 - i.e.

permutation = x&4? (3*4+2)0x11 : (14+0)*0x11

dist0 = byte_permute(dist0,dist0,permutation)

On the loop exit, you just right shift results by 16 bits

  1. The entire shuffle machinery can be replaced with storing to shmem and loading data back in transposed order, so then you need only to sum up dist0…3 and select 16 bits depending on x&4. It may have larger delays, but less amount of operations, so you should get speedup if you run enough warps/SM

Thank you for your prompt and thorough response! Addressing your suggestions: the arrays do not alias, promise. Virtual architecture and disabling all optimizations turned off did not cause it to start working. Neither did removing the restrict’s. Compiling with -G, however, did (and caused massive slow-down, of course.) I am absolutely dependent upon my laptop for the next few weeks due to real-life concerns, and can not justify updating my graphics drivers and CUDA install just yet. I will report back when I get a chance to.

If you would like, you can freely comment out all that stuff. It just provides a sanity check. If the number of matches returned by the GPU side doesn’t change it is reasonable to assume it is correct.

That is not an optimization. It is not equivalent. Neither is the other change from your deleted post. Try making the change and recompiling, you will notice they do not pass the check. You can compact the reduction more by conditionally swapping variables first, but it is no faster.

That said, if you manage to find a working implementation which beats the one I put up, do please let me know. The code will see broad use. Note that we have switched to loading query vectors in as a texture, which increased performance by ~6%. It did not fix the (compiler?) bug.

Counter-intuitively, using constants for the bit mask for the big nasty line (0xFF can replace the & portion) is actually slower by >10%. We’re under severe register pressure. However, your suggestion for byte perm did inspire this, which is (statistically-significant) 0.6% faster:

dist0 = __byte_perm(dist0 + __shfl_xor(dist0, 4), 0, threadIdx.x & 4 ? 0x5432 : 0x5410);

Disabling unroll did not work. I know register does nothing, it is just a personal coding style to indicate that if that variable hits local memory something has gone awry.

Once you are done with the due diligence process, you would probably want to file a bug with NVIDIA. The reporting form is linked from the registered developer website.

In the mean time, as a potential practical workaround, try reducing the backend (PTXAS) optimization level, that will be a bit more fine-grained compared to the giant -G hammer. The compiler default is -Xptxas -O3, so try -O2, then -O1, finally -O0. The benefit of this approach is that all high-level optimizations performed by the NVVM frontend are still enabled, and performance drop should be less dramatic.

BTW, the fact that inserting a guarded printf() fixes the problem is not surprising. Function calls with potential side-effects typically prevent (or severely limit) code motion across the call. Since my working hypothesis is that the issue observed is related to undesirable code motion, inserting such a call could easily make the problem disappear, whether the issue is caused by a compiler bug or a violation of the programming model. Even inserting an if-statement per se might limit code motion by splitting a large basic block into two basic blocks.


About a year ago I tracked down a really nasty PTXAS issue that I eventually blamed on back-to-back “if” statements. It seems somewhat similar to your lines 39 and 41. PTXAS was incorrectly compiling “if§” followed by an “if(!p)” into something incorrect. And it only happened on Kepler… so compiler bugs do exist.

The workaround was to change the expression.

I’m going to wildly guess that lines 39 and 41 are being optimized in some unexpected way to avoid the potential double assignment of “second_v”.

I don’t know if this fixes anything but the sm_52 cubin looks different:

#if 0 // original
    if (dist0 < second_v) 
      second_v = dist0;

    if (dist0 < best_v) {
      second_v = best_v;
      best_v   = dist0;
      best_i   = t;
    bool const lt_second_v = dist0 < second_v;
    bool const lt_best_v   = dist0 < best_v;

    if (lt_second_v && !lt_best_v)
        second_v = dist0;
    else if (lt_best_v)
        second_v = best_v;
        best_v   = dist0;
        best_i   = t;

I commented out the include of immintrin.h from match.h
I also added a define:

//#include <immintrin.h>
#define _mm_popcnt_u64(x) __builtin_popcount(x)

I also removed all references to -mavx2 -mfma from the Makefile. That allowed me to build the code.

I also uncommented a return statement at the end of main.cpp:

if (matches.size() != m.matches.size()) {
                std::cout << "MISMATCH!" << std::endl;
                return EXIT_FAILURE;  // uncommented this

which got rid of a seg fault in the subsequent code.

With those changes, using the Makefile default sm_30 compilation target, and running the code on a cc3.5 device (GT640) I get:

CUDA: found 16 matches in 77.5035 ms
Throughput: 3.04412 billion comparisons/second.

Brute: found 0 matches in 6547.95 ms
Throughput: 0.036031 billion comparisons/second.


Modifying the Makefile to compile for sm_52 and running on a cc5.2 device (GTX960) I get:

CUDA: found 16 matches in 18.0233 ms
Throughput: 13.0902 billion comparisons/second.

Brute: found 0 matches in 10096.2 ms
Throughput: 0.0233682 billion comparisons/second.


This is all on Fedora 20, CUDA 7.5. In both cases I’ve run the codes several times, I always get 16 or 17 matches reported on the GPU, none on the CPU (probably my CPU code mod is invalid). I’ve also run both tests with cuda-memcheck; no errors are reported. So at the moment I’m unable to reproduce this observation:

“and never finds any matches on my 970M”

So there may be some other platform difference that is important, or else it may be specific to 970M.

What GPU driver are you using in that failing (970M) case?
Also, when you run the code in the failing case, have you used the Makefile you provided, which has the GPU arch target set to sm_30, or are you modifying the Makefile like I did for sm_52?

EDIT: Forgot to mention in the Makefile I also changed -std=gnu++14 to -std=gnu++11.

Also, my code mod was boneheaded. I switched the define to:

#define _mm_popcnt_u64(x) __builtin_popcountll(x)

and now the cpu and gpu results match.

CUDA: found 16 matches in 77.5164 ms
Throughput: 3.04361 billion comparisons/second.

Brute: found 16 matches in 6486.51 ms
Throughput: 0.0363723 billion comparisons/second.


I took a look at the SASS generated for sm_30 and sm_50 and haven’t found anything yet that looks out of place regarding the if-statements highlighted by allanmac. It took me a few minutes to get my bearings because NVVM seems to translate

if (dist0 < second_v) second_v = dist0;


second_v = min (dist0, second_v)

Likewise, it translates

if (dist0 < best_v) {
         best_v = dist0;


best_v = min (dist0, best_v);

Ah, that’s even better.

So you could also write it this way:

    second_v = min(dist0,second_v);
    if (dist0 < best_v) {
      second_v = best_v;
      best_i   = t;

    best_v = min(dist0,best_v);

Since the CUDA compiler already optimizes the code into essentially this form automagically, I would advocate writing the source code in the most natural manner.