If I change a logical AND to anything else (XOR, ADD) I get millions of times more performance... WH...

My CUDA program on a GTX 1060 is going way slower than my 10 year old 4-core CPU and it really shouldn’t even be close! I have discovered that my Kernel does not seem to like a single logical AND in it! I cannot fathom why!

//A1 & FL are UL bitmasks
//const uint32_t r = (((uint32_t)t) & A1) ^ (uint32_t)(t >> FL);
const uint32_t r = (((uint32_t)t) ^ A1) ^ (uint32_t)(t >> FL);

The commented-out code is what I need to have happen. But it’s millions of times slower than anything else. I can even change out the XOR for an ADD, or an OR. But as soon as it’s any kind of AND, I get extremely slow (software-emulated?) performance. I even removed “& A1” and went with just “& 1” and it was still slow.

First off, we are talking about a release build here, correct? Are you building code for the correct compute capability? You would want to use the CUDA profiler to tell you about performance bottlenecks in the code. You may not use the right amount of parallelism, examine your launch configuration. A rough first target for a GTX 1060 would be to have 10,000 threads executing in parallel.

The amount of code shown is insufficient to make a diagnosis about switching around logical operations. Like all modern compilers, the CUDA compiler aggressively optimizes logical expressions. It’s possible that by changing the AND to an XOR the compiler determines that the expression always evaluates to some fixed value, and then propagates that further through the code, eliminating vast portions of the code through dead code elimination. That would make the kernel execute very quickly.

You would want to look at the generated machine code (SASS) with cuobjdump --dump-sass to get a feel for what happens to the code based on your changes.

I doubt that FL is a bit mask as stated, as it is used as a shift factor in the code. Given that ‘r’ is a uint32_t, is it necessary for A1 and FL to be UL? Is that UL as in “unsigned long”? If so, don’t use that, as the bit width of that type differs across platforms. Use uint64_t or uint32_t as appropriate.

As opposed to most modern CPUs, which are 64-bit processors, GPUs are essentially 32-bit processors with 64-bit addressing capabilities. As a consequence, 64-bit integer operations are always emulated. These emulations are usually efficient: 64-bit logical operations are simply split into two 32-bit logical operations.

1 Like

Hey, thanks for your reply. I’ve pasted the code here, it’s about a page long:


Do you know how to config VS 2017 Community to output the ASM (is the “asm” called PTX? Or is it called Cubin?) to a folder? If I could see if maybe having the logical-AND is causing some sort of weird compiling issue.

If I run “cuobjdump --dump-sass” on my obj I get:

“cuobjdump error : ‘nvdisasm’ died with status 0xC0000005 (ACCESS_VIOLATION)”

Take your executable (not object file) and run it through cuobjdump --dump-sass. That will give you the disassembled GPU machine code. You can also dump the intermediate PTX code with cuobjdump (provided your build was set up to include PTX in the executable), but since PTX is compiled into SASS by an optimizing compiler, you really want to look at SASS to see what actually gets to run on the GPU.

I don’t use IDEs for building CUDA code, so I am afraid I can’t help you there.

It only outputs the 3x code before it gives the same access violation on 5x code and then it just aborts. Any idea why?

This doesn’t seem right. The GTX 1060 is a device with compute capability 6.1, so you would want to build your executable using -arch=sm_61, or equivalent -gencode settings.

What CUDA version are you using for your builds? I am using version 8, on Windows.

This is the shortened command line:

nvcc.exe -gencode=arch=compute_30,code=“sm_30,compute_30” -gencode=arch=compute_35,code=“sm_35,compute_35” -gencode=arch=compute_37,code=“sm_37,compute_37” -gencode=arch=compute_50,code=“sm_50,compute_50” -gencode=arch=compute_52,code=“sm_52,compute_52” -gencode=arch=compute_60,code=“sm_60,compute_60” -gencode=arch=compute_61,code=“sm_61,compute_61” -gencode=arch=compute_70,code=“sm_70,compute_70” --use-local-env -ccbin “C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.13.26128\bin\HostX86\x64” -x cu -I./ -I…/…/common/inc -I./ -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.2/include" -I…/…/common/inc -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.2\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -Xcompiler “/wd 4819” -DWIN32 -DWIN32 -D_MBCS -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Ox /FS /Zi /MT " -o x64/Release/fmul.cu.obj “C:\ProgramData\NVIDIA Corporation\CUDA Samples\v9.2\0_Simple\FMUL\fmul.cu”


I would suggest removing all build target architectures other than compute capability 6.1 while you are debugging the present issue. While all the other platforms are possibly needed for your production code, it always helps to cut down code to just what is needed for the platform at hand.

I have already tried that :P I only recently started adding the additional platforms recently to see if it would help.

Fatbin elf code:

arch = sm_61
code version = [1,7]
producer = cuda
host = windows
compile_size = 64bit

code for sm_61
	Function : _Z4fmuljjPjS_

cuobjdump error : ‘nvdisasm’ died with status 0xC0000005 (ACCESS_VIOLATION)

This tool seems to not let me dump any code from anything newer than 3x…

That’s weird. I don’t have any ideas what’s going on. Maybe there is a bug in the disassembler, maybe you have a broken CUDA installation.

If you simply want to see how changing the logical operator in the source code impacts the generated code, compiling for sm_30 should be sufficient, but it may not be telling the whole story as the compiler backend for sm_30 is a different one than for sm_61, and these backends are optimizing compilers, not just assemblers, contrary to what the name PXTAS may suggest (this is the component that translates PTX to SASS).

I built your code with CUDA 8.0 using -arch=sm_61, and ran it on a Quadro P2000, which is quite similar to a GTX 1060 in its specifications.

I can disassemble the generated machine code just fine. There do not seem to be any major differences between using XOR and AND. The version using AND printed “Done in 1 mins”. The version using XOR printed “Done in 5 mins”. What kind of run time would you expect? Please note that I was running two other long-running CUDA-accelerated apps at the same time I was running yours.

Have you checked your code for race conditions and non-deterministic behavior (e.g. accessing uninitialized or out-of-range data)?

My code changes were as follows:

#define USE_XOR  1

    const uint32_t r = (t & A1) ^ (t >> FL);
    const uint32_t r = (t & A1) & (t >> FL);

The line of code is actually: const uint32_t r = (t & A1) ^ (t >> FL);

The slow & is between t and A1 in the first bracket. I have a brand new install of CUDA 9.2 and VS 2017.

Can you compile and run this code to completion and see how long it takes? With & it takes my computer 25 minutes, with anything else it’s just < 100ms (instant). There is clearly a bizarre bug going on here.

Here is the new(est) code, please try running it!


Sorry, got that wrong. I have now fixed that. The disassembly works fine and shows no major differences, mostly just a different LOP3 flavor depending on whether AND or XOR is used. But the algorithm seems data dependent, implying run time will be a function of what kind of data is being generated to feed into the atomic add. And the data differs between AND and XOR. Which would make the substitution a red herring?

With your latest code I get this (I suspended my other CUDA apps):

C:\Users\Norbert\My Programs>logical_issue.exe
Starting 20-bit...

30 unusual -1's
       0 unusual 0's
       0 unusuals

 0: 0
-1: 0


Done in 374 ms

If I run the app repeatedly, the output and the run time changes with every run. It looks like you have non-deterministic code somewhere.

Hmm that is definitely not the correct answer. Did you compile my latest code? I made many changes!

Set FL to 18 bits instead of 20 bits (for my sake). I can complete the 18-bit field:

Starting 18-bit...

 2562531 unusual -1's
 1519968 unusual 0's
  404864 unusuals

 0: 3564223
-1: 5649345


Done in 4555 ms

I have the results (CPU calculated) for other fields. I am not sure why your compile would not generate a correct result.

The code is definitely not deterministic since I get correct (CPU verified) results, consistently.