Quick question for CUDA expert(s):
Is the execution speed of the modulo operator using CUDA optimal for the Volta V100 GPU? That is to say, are there situations where I should be concerned about potential performance slow-downs when using the modulo operator within a CUDA kernel?
A modulo operation with a compile-time constant divisor will be heavily optimized by the compiler using well-known techniques that have been around for 20+ years.
A modulo operation with a variable divisor is a “slow” operation, possibly slightly more so than in host code running on the CPU. While CPUs based on x86, ARM, Power have a (slow) hardware instruction for integer division, GPUs use a canned instruction sequence.
Before you start worrying about this, I would suggest using the CUDA profiler to determine the bottlenecks in your code. In many cases CUDA kernels are memory bound and not compute bound.
Division on modern X86 processors is quite fast. In my testing I found it to be only a few percent slower than multiplication. I had been in the habit of changing /2 to *0.5 and similar things and I stopped since it made little to no difference. I have no idea how it is on other processors but I expect it would be slow because most use an iterative algorithm.
I think you will find that this applies to floating-point division only, which also matches the example you provided. On x86, floating-point division is down to single-digit cycle execution times now. There should never have been a need to replace floating-point division by two with multiplication manually, as compilers have been routinely applying that substitution for decades.
With the help of FMA, it is possible to accelerate floating-point division by other constants than powers of two, but with the high speed FP divide on x86, that probably doesn’t make sense anymore. Still applicable to GPUs, though, but you might have to do it manually.
[Later:] Checking Agner Fog’s instruction tables, it seems I slightly misremembered. For Skylake, it shows FP division (VDIVPS) latency at 11 cycles, vs 4 cycles for FP multiplication (VMULPS).