Speed of modulo operator in CUDA

rhaney · September 13, 2019, 3:48pm

Good morning,

Quick question for CUDA expert(s):
Is the execution speed of the modulo operator using CUDA optimal for the Volta V100 GPU? That is to say, are there situations where I should be concerned about potential performance slow-downs when using the modulo operator within a CUDA kernel?

Thanks.

njuffa · September 13, 2019, 7:04pm

A modulo operation with a compile-time constant divisor will be heavily optimized by the compiler using well-known techniques that have been around for 20+ years.

A modulo operation with a variable divisor is a “slow” operation, possibly slightly more so than in host code running on the CPU. While CPUs based on x86, ARM, Power have a (slow) hardware instruction for integer division, GPUs use a canned instruction sequence.

Before you start worrying about this, I would suggest using the CUDA profiler to determine the bottlenecks in your code. In many cases CUDA kernels are memory bound and not compute bound.

ryork · September 13, 2019, 7:38pm

Division on modern X86 processors is quite fast. In my testing I found it to be only a few percent slower than multiplication. I had been in the habit of changing /2 to *0.5 and similar things and I stopped since it made little to no difference. I have no idea how it is on other processors but I expect it would be slow because most use an iterative algorithm.

njuffa · September 13, 2019, 7:40pm

I think you will find that this applies to floating-point division only, which also matches the example you provided. On x86, floating-point division is down to single-digit cycle execution times now. There should never have been a need to replace floating-point division by two with multiplication manually, as compilers have been routinely applying that substitution for decades.

With the help of FMA, it is possible to accelerate floating-point division by other constants than powers of two, but with the high speed FP divide on x86, that probably doesn’t make sense anymore. Still applicable to GPUs, though, but you might have to do it manually.

[Later:] Checking Agner Fog’s instruction tables, it seems I slightly misremembered. For Skylake, it shows FP division (VDIVPS) latency at 11 cycles, vs 4 cycles for FP multiplication (VMULPS).

rhaney · September 13, 2019, 8:34pm

Thanks to all for the great responses.

The divisor is a const unsigned int, so (if what I am reading from everyone’s posts) should not be an issue.

njuffa · September 13, 2019, 9:04pm

Based on your description, you should fine assuming the dividend is also an unsigned integer of some kind.

When in doubt, it is never a bad idea to check the generated machine code (SASS): cuobjdump --dumpsass

Topic		Replies	Views
I have a question about Cuda CUDA Programming and Performance	1	369	October 15, 2019
Why is modulus so slow? CUDA Programming and Performance	9	4873	May 21, 2010
Integer modulo CUDA Programming and Performance	2	7288	February 25, 2011
How slow is integer division and modulo? CUDA Programming and Performance	11	11022	September 23, 2008
Speed comparison of division compared to other arithmetic operations, perhaps something like clock cycles CUDA Programming and Performance	9	5963	November 19, 2024
Is float computation really so slow? CUDA Programming and Performance	3	767	November 25, 2014
division in CUDA Fortran Legacy PGI Compilers	2	3335	December 4, 2010
Is it possible to replace integer division by floating-point division for speed CUDA Programming and Performance cuda	9	2388	March 24, 2022
division and modulo operations on indices CUDA Programming and Performance	9	864	September 6, 2017
Seemingly insignificant changes result in a 100x kernel slowdown CUDA Programming and Performance	2	565	February 14, 2020

Speed of modulo operator in CUDA

Related topics