First off, we are talking about a release build here, correct? Are you building code for the correct compute capability? You would want to use the CUDA profiler to tell you about performance bottlenecks in the code. You may not use the right amount of parallelism, examine your launch configuration. A rough first target for a GTX 1060 would be to have 10,000 threads executing in parallel.
The amount of code shown is insufficient to make a diagnosis about switching around logical operations. Like all modern compilers, the CUDA compiler aggressively optimizes logical expressions. It’s possible that by changing the AND to an XOR the compiler determines that the expression always evaluates to some fixed value, and then propagates that further through the code, eliminating vast portions of the code through dead code elimination. That would make the kernel execute very quickly.
You would want to look at the generated machine code (SASS) with
cuobjdump --dump-sass to get a feel for what happens to the code based on your changes.
I doubt that FL is a bit mask as stated, as it is used as a shift factor in the code. Given that ‘r’ is a uint32_t, is it necessary for A1 and FL to be UL? Is that UL as in “unsigned long”? If so, don’t use that, as the bit width of that type differs across platforms. Use uint64_t or uint32_t as appropriate.
As opposed to most modern CPUs, which are 64-bit processors, GPUs are essentially 32-bit processors with 64-bit addressing capabilities. As a consequence, 64-bit integer operations are always emulated. These emulations are usually efficient: 64-bit logical operations are simply split into two 32-bit logical operations.