In the documentation, it claims that the use of the modulo operator (%) is very expensive in terms of compute time. I was wondering if there was a good average of the instruction cost of modulo, or if it had branch properties that would slow the warp.
In most architectures modulo costs about the same as a divide. I don’t know if that is also true for CUDA hardware. I doubt that it involves any branching though.
I finally did some testing on this.
To check modulo, I used the following kernel:
g_SharedMemory[threadIdx.x]=(int)threadIdx.x%(int)threadIdx.y;
The generated code is this:
[codebox]000000: a0000005 04000780 cvt.u32.u16 $r1, $r0.lo
000008: d0800211 00400780 and.b16 $r2.lo, $r0.hi, c1[0x0000]
000010: 10000201 0403c780 mov.b32 $r0, $r1
000018: a0000809 04000780 cvt.u32.u16 $r2, $r2.lo
000020: 20009003 00000780 call.label label0
000028: 00020205 c0000780 movsh.b32 $ofs1, $r1, 0x00000002
000030: a0000001 44014780 cvt.rn.f32.s32 $r0, $r0
000038: 04000801 e4200780 mov.b32 s[$ofs1+0x0010], $r0
000040: 30000003 00000780 return
000048: a000040d 04114780 label0: cvt.u32.s32 $r3, $r2 (Unknown subsubop 45)
000050: a0000611 44004780 cvt.rn.f32.u32 $r4, $r3
000058: a0000015 04114780 cvt.u32.s32 $r5, $r0 (Unknown subsubop 45)
000060: 90000819 00000780 rcp.f32 $r6, $r4
000068: a0000a11 44064780 cvt.rz.f32.u32 $r4, $r5
000070: 203e8c19 0fffffff add.b32 $r6, $r6, 0xfffffffe
000078: c0060811 0000c7c0 mul.rz.f32 $p0|$r4, $r4, $r6
000080: a0000811 84064780 cvt.rzi.u32.f32 $r4, $r4
000088: 40090c1d 00000780 mul24.lo.u32.u16.u16 $r7, $r3.lo, $r4.hi
000090: 60080e1d 0001c780 mad24.lo.u32.u16.u16.u32 $r7, $r3.hi, $r4.lo, $r7
000098: 30100e1d c4100780 shl.u32 $r7, $r7, 0x00000010
0000a0: 60080c1d 0001c780 mad24.lo.u32.u16.u16.u32 $r7, $r3.lo, $r4.lo, $r7
0000a8: 20400a1d 0401c780 sub.u32 $r7, $r5, $r7
0000b0: a0000e1d 44064780 cvt.rz.f32.u32 $r7, $r7
0000b8: c0060e19 0000c7c0 mul.rz.f32 $p0|$r6, $r7, $r6
0000c0: a0000c19 84064780 cvt.rzi.u32.f32 $r6, $r6
0000c8: 20000811 04018780 add.u32 $r4, $r4, $r6
0000d0: 40061219 00000780 mul24.lo.u32.u16.u16 $r6, $r4.hi, $r3.lo
0000d8: 60071019 00018780 mad24.lo.u32.u16.u16.u32 $r6, $r4.lo, $r3.hi, $r6
0000e0: 30100c19 c4100780 shl.u32 $r6, $r6, 0x00000010
0000e8: 60061019 00018780 mad24.lo.u32.u16.u16.u32 $r6, $r4.lo, $r3.lo, $r6
0000f0: 30000c19 04014780 subr.u32 $r6, $r6, $r5
0000f8: 30060619 6400c780 set.le.u32 $r6, $r3, $r6
000100: 30000c11 04010780 subr.u32 $r4, $r6, $r4
000108: 40061219 00000780 mul24.lo.u32.u16.u16 $r6, $r4.hi, $r3.lo
000110: 60071019 00018780 mad24.lo.u32.u16.u16.u32 $r6, $r4.lo, $r3.hi, $r6
000118: 30100c19 c4100780 shl.u32 $r6, $r6, 0x00000010
000120: 6006100d 00018780 mad24.lo.u32.u16.u16.u32 $r3, $r4.lo, $r3.lo, $r6
000128: 301f0001 e4100780 shr.u32 $r0, $r0, 0x0000001f
000130: 30000611 04014780 subr.u32 $r4, $r3, $r5
000138: a000000d 2c014780 cvt.neg.s32 $r3, $r0
000140: d004060d 04008780 xor.b32 $r3, $r3, $r4
000148: 307c05fd 6c0147c8 set.ne.s32 $p0|$o127, $r2, $r60 (unk0 00400000)
000150: 20000001 0400c780 add.u32 $r0, $r0, $r3
000158: d0020001 0402c500 @$p0.equ not.b32 $r0, $r2
000160: 30000003 00000780 return
000168: f0000001 e0000001 nop.end
[/codebox]
So no jumps, just one conditional write (Embarrassingly, I got this wrong in another post. ;) ). If I just add up the instructions, I come up with 140 cycles max. Of course, real life will not be quite as bad.
For integer division, I get
[codebox]000000: a0000005 04000780 cvt.u32.u16 $r1, $r0.lo
000008: d0800211 00400780 and.b16 $r2.lo, $r0.hi, c1[0x0000]
000010: 10000201 0403c780 mov.b32 $r0, $r1
000018: a0000809 04000780 cvt.u32.u16 $r2, $r2.lo
000020: 20009003 00000780 call.label label0
000028: 00020205 c0000780 movsh.b32 $ofs1, $r1, 0x00000002
000030: a0000001 44014780 cvt.rn.f32.s32 $r0, $r0
000038: 04000801 e4200780 mov.b32 s[$ofs1+0x0010], $r0
000040: 30000003 00000780 return
000048: a000040d 04114780 label0: cvt.u32.s32 $r3, $r2 (Unknown subsubop 45)
000050: a0000611 44004780 cvt.rn.f32.u32 $r4, $r3
000058: a0000015 04114780 cvt.u32.s32 $r5, $r0 (Unknown subsubop 45)
000060: 90000819 00000780 rcp.f32 $r6, $r4
000068: a0000a11 44064780 cvt.rz.f32.u32 $r4, $r5
000070: 203e8c19 0fffffff add.b32 $r6, $r6, 0xfffffffe
000078: c0060811 0000c7c0 mul.rz.f32 $p0|$r4, $r4, $r6
000080: a0000811 84064780 cvt.rzi.u32.f32 $r4, $r4
000088: 40090c1d 00000780 mul24.lo.u32.u16.u16 $r7, $r3.lo, $r4.hi
000090: 60080e1d 0001c780 mad24.lo.u32.u16.u16.u32 $r7, $r3.hi, $r4.lo, $r7
000098: 30100e1d c4100780 shl.u32 $r7, $r7, 0x00000010
0000a0: 60080c1d 0001c780 mad24.lo.u32.u16.u16.u32 $r7, $r3.lo, $r4.lo, $r7
0000a8: 20400a1d 0401c780 sub.u32 $r7, $r5, $r7
0000b0: a0000e1d 44064780 cvt.rz.f32.u32 $r7, $r7
0000b8: c0060e19 0000c7c0 mul.rz.f32 $p0|$r6, $r7, $r6
0000c0: a0000c19 84064780 cvt.rzi.u32.f32 $r6, $r6
0000c8: 20000811 04018780 add.u32 $r4, $r4, $r6
0000d0: 40061219 00000780 mul24.lo.u32.u16.u16 $r6, $r4.hi, $r3.lo
0000d8: 60071019 00018780 mad24.lo.u32.u16.u16.u32 $r6, $r4.lo, $r3.hi, $r6
0000e0: 30100c19 c4100780 shl.u32 $r6, $r6, 0x00000010
0000e8: 60061019 00018780 mad24.lo.u32.u16.u16.u32 $r6, $r4.lo, $r3.lo, $r6
0000f0: 30000c15 04014780 subr.u32 $r5, $r6, $r5
0000f8: 3005060d 6400c780 set.le.u32 $r3, $r3, $r5
000100: d0000401 04008780 xor.b32 $r0, $r2, $r0
000108: 301f0001 e4100780 shr.u32 $r0, $r0, 0x0000001f
000110: 30000611 04010780 subr.u32 $r4, $r3, $r4
000118: a000000d 2c014780 cvt.neg.s32 $r3, $r0
000120: d004060d 04008780 xor.b32 $r3, $r3, $r4
000128: 307c05fd 6c0147c8 set.ne.s32 $p0|$o127, $r2, $r60 (unk0 00400000)
000130: 20000001 0400c780 add.u32 $r0, $r0, $r3
000138: d0020001 0402c500 @$p0.equ not.b32 $r0, $r2
000140: 30000003 00000780 return
000148: f0000001 e0000001 nop.end
[/codebox]
It’s exactly the same code till 0x0000f0.
Anyway, don’t use modulo or idiv in an inner loop. Ever.
Fortunately, I only have to use it once or twice.
Another question:
I have to use modulo on 64 bit operands. Will nvcc be able to compile this down for CUDA devices?