Modulo Cost

In the documentation, it claims that the use of the modulo operator (%) is very expensive in terms of compute time. I was wondering if there was a good average of the instruction cost of modulo, or if it had branch properties that would slow the warp.

In most architectures modulo costs about the same as a divide. I don’t know if that is also true for CUDA hardware. I doubt that it involves any branching though.

I finally did some testing on this.

To check modulo, I used the following kernel:

g_SharedMemory[threadIdx.x]=(int)threadIdx.x%(int)threadIdx.y;

The generated code is this:

[codebox]000000: a0000005 04000780 cvt.u32.u16 $r1, $r0.lo

000008: d0800211 00400780 and.b16 $r2.lo, $r0.hi, c1[0x0000]

000010: 10000201 0403c780 mov.b32 $r0, $r1

000018: a0000809 04000780 cvt.u32.u16 $r2, $r2.lo

000020: 20009003 00000780 call.label label0

000028: 00020205 c0000780 movsh.b32 $ofs1, $r1, 0x00000002

000030: a0000001 44014780 cvt.rn.f32.s32 $r0, $r0

000038: 04000801 e4200780 mov.b32 s[$ofs1+0x0010], $r0

000040: 30000003 00000780 return

000048: a000040d 04114780 label0: cvt.u32.s32 $r3, $r2 (Unknown subsubop 45)

000050: a0000611 44004780 cvt.rn.f32.u32 $r4, $r3

000058: a0000015 04114780 cvt.u32.s32 $r5, $r0 (Unknown subsubop 45)

000060: 90000819 00000780 rcp.f32 $r6, $r4

000068: a0000a11 44064780 cvt.rz.f32.u32 $r4, $r5

000070: 203e8c19 0fffffff add.b32 $r6, $r6, 0xfffffffe

000078: c0060811 0000c7c0 mul.rz.f32 $p0|$r4, $r4, $r6

000080: a0000811 84064780 cvt.rzi.u32.f32 $r4, $r4

000088: 40090c1d 00000780 mul24.lo.u32.u16.u16 $r7, $r3.lo, $r4.hi

000090: 60080e1d 0001c780 mad24.lo.u32.u16.u16.u32 $r7, $r3.hi, $r4.lo, $r7

000098: 30100e1d c4100780 shl.u32 $r7, $r7, 0x00000010

0000a0: 60080c1d 0001c780 mad24.lo.u32.u16.u16.u32 $r7, $r3.lo, $r4.lo, $r7

0000a8: 20400a1d 0401c780 sub.u32 $r7, $r5, $r7

0000b0: a0000e1d 44064780 cvt.rz.f32.u32 $r7, $r7

0000b8: c0060e19 0000c7c0 mul.rz.f32 $p0|$r6, $r7, $r6

0000c0: a0000c19 84064780 cvt.rzi.u32.f32 $r6, $r6

0000c8: 20000811 04018780 add.u32 $r4, $r4, $r6

0000d0: 40061219 00000780 mul24.lo.u32.u16.u16 $r6, $r4.hi, $r3.lo

0000d8: 60071019 00018780 mad24.lo.u32.u16.u16.u32 $r6, $r4.lo, $r3.hi, $r6

0000e0: 30100c19 c4100780 shl.u32 $r6, $r6, 0x00000010

0000e8: 60061019 00018780 mad24.lo.u32.u16.u16.u32 $r6, $r4.lo, $r3.lo, $r6

0000f0: 30000c19 04014780 subr.u32 $r6, $r6, $r5

0000f8: 30060619 6400c780 set.le.u32 $r6, $r3, $r6

000100: 30000c11 04010780 subr.u32 $r4, $r6, $r4

000108: 40061219 00000780 mul24.lo.u32.u16.u16 $r6, $r4.hi, $r3.lo

000110: 60071019 00018780 mad24.lo.u32.u16.u16.u32 $r6, $r4.lo, $r3.hi, $r6

000118: 30100c19 c4100780 shl.u32 $r6, $r6, 0x00000010

000120: 6006100d 00018780 mad24.lo.u32.u16.u16.u32 $r3, $r4.lo, $r3.lo, $r6

000128: 301f0001 e4100780 shr.u32 $r0, $r0, 0x0000001f

000130: 30000611 04014780 subr.u32 $r4, $r3, $r5

000138: a000000d 2c014780 cvt.neg.s32 $r3, $r0

000140: d004060d 04008780 xor.b32 $r3, $r3, $r4

000148: 307c05fd 6c0147c8 set.ne.s32 $p0|$o127, $r2, $r60 (unk0 00400000)

000150: 20000001 0400c780 add.u32 $r0, $r0, $r3

000158: d0020001 0402c500 @$p0.equ not.b32 $r0, $r2

000160: 30000003 00000780 return

000168: f0000001 e0000001 nop.end

[/codebox]

So no jumps, just one conditional write (Embarrassingly, I got this wrong in another post. ;) ). If I just add up the instructions, I come up with 140 cycles max. Of course, real life will not be quite as bad.

For integer division, I get

[codebox]000000: a0000005 04000780 cvt.u32.u16 $r1, $r0.lo

000008: d0800211 00400780 and.b16 $r2.lo, $r0.hi, c1[0x0000]

000010: 10000201 0403c780 mov.b32 $r0, $r1

000018: a0000809 04000780 cvt.u32.u16 $r2, $r2.lo

000020: 20009003 00000780 call.label label0

000028: 00020205 c0000780 movsh.b32 $ofs1, $r1, 0x00000002

000030: a0000001 44014780 cvt.rn.f32.s32 $r0, $r0

000038: 04000801 e4200780 mov.b32 s[$ofs1+0x0010], $r0

000040: 30000003 00000780 return

000048: a000040d 04114780 label0: cvt.u32.s32 $r3, $r2 (Unknown subsubop 45)

000050: a0000611 44004780 cvt.rn.f32.u32 $r4, $r3

000058: a0000015 04114780 cvt.u32.s32 $r5, $r0 (Unknown subsubop 45)

000060: 90000819 00000780 rcp.f32 $r6, $r4

000068: a0000a11 44064780 cvt.rz.f32.u32 $r4, $r5

000070: 203e8c19 0fffffff add.b32 $r6, $r6, 0xfffffffe

000078: c0060811 0000c7c0 mul.rz.f32 $p0|$r4, $r4, $r6

000080: a0000811 84064780 cvt.rzi.u32.f32 $r4, $r4

000088: 40090c1d 00000780 mul24.lo.u32.u16.u16 $r7, $r3.lo, $r4.hi

000090: 60080e1d 0001c780 mad24.lo.u32.u16.u16.u32 $r7, $r3.hi, $r4.lo, $r7

000098: 30100e1d c4100780 shl.u32 $r7, $r7, 0x00000010

0000a0: 60080c1d 0001c780 mad24.lo.u32.u16.u16.u32 $r7, $r3.lo, $r4.lo, $r7

0000a8: 20400a1d 0401c780 sub.u32 $r7, $r5, $r7

0000b0: a0000e1d 44064780 cvt.rz.f32.u32 $r7, $r7

0000b8: c0060e19 0000c7c0 mul.rz.f32 $p0|$r6, $r7, $r6

0000c0: a0000c19 84064780 cvt.rzi.u32.f32 $r6, $r6

0000c8: 20000811 04018780 add.u32 $r4, $r4, $r6

0000d0: 40061219 00000780 mul24.lo.u32.u16.u16 $r6, $r4.hi, $r3.lo

0000d8: 60071019 00018780 mad24.lo.u32.u16.u16.u32 $r6, $r4.lo, $r3.hi, $r6

0000e0: 30100c19 c4100780 shl.u32 $r6, $r6, 0x00000010

0000e8: 60061019 00018780 mad24.lo.u32.u16.u16.u32 $r6, $r4.lo, $r3.lo, $r6

0000f0: 30000c15 04014780 subr.u32 $r5, $r6, $r5

0000f8: 3005060d 6400c780 set.le.u32 $r3, $r3, $r5

000100: d0000401 04008780 xor.b32 $r0, $r2, $r0

000108: 301f0001 e4100780 shr.u32 $r0, $r0, 0x0000001f

000110: 30000611 04010780 subr.u32 $r4, $r3, $r4

000118: a000000d 2c014780 cvt.neg.s32 $r3, $r0

000120: d004060d 04008780 xor.b32 $r3, $r3, $r4

000128: 307c05fd 6c0147c8 set.ne.s32 $p0|$o127, $r2, $r60 (unk0 00400000)

000130: 20000001 0400c780 add.u32 $r0, $r0, $r3

000138: d0020001 0402c500 @$p0.equ not.b32 $r0, $r2

000140: 30000003 00000780 return

000148: f0000001 e0000001 nop.end

[/codebox]

It’s exactly the same code till 0x0000f0.

Anyway, don’t use modulo or idiv in an inner loop. Ever.

Fortunately, I only have to use it once or twice.

Another question:

I have to use modulo on 64 bit operands. Will nvcc be able to compile this down for CUDA devices?