Thank you both @njuffa @striker159 for these clarifications
The IMAD.WIDE primitive being the actual hardware target makes perfect sense and explains why my manual mad.lo.cc.u32 / madc.hi.u32 approach didn’t deliver the expected speedup.
When I integrated it into the real kernel, throughput actually dropped by about 4% compared to the original the explicit carry chains likely prevented ptxas from coalescing into IMAD.WIDE.
Im going to follow your suggested aproach
rather than write 32-bit PTX with explicit carries, write the multiply at a higher level using mul.wide.u32 + 64-bit addition, and trust ptxas to coalesce into IMAD.WIDE.
A few specific questions before I attempt the rewrite…
-
Should the accumulator be u64 or kept as paired u32 registers?
The NVIDIA-generated code uses 64-bit operands throughout.
Does ptxas handle the column accumulation better when operands are presented as 64-bit?
-
For my reduction step (multiply by a small constant that fits in ~33 bits):
would you express this as a sequence of mul.wide.u32 limb[i], constant + accumulation in 64-bit, or is there a better idiom for “multiply a wide integer by a constant smaller than 64 bits”?
-
Final carry-out of the 256-bit result:
even with IMAD.WIDE absorbing most carries, the final accumulation across the limbs eventually needs some explicit carry propagation.
Is there a recommended way to minimize this, or does it not matter much in practice?
Id rather learn the right idiom from you than continue trial and error with PTX patterns that look correct but get rejected by ptxas optimizer.
Test 1 mul.wide.u32 + add.u64 in a single asm block: Does ptxas coalesce the explicit mul.wide followed by 64-bit add into a single IMAD.WIDE.U32? This is the most direct expression of the operation in PTX.
Test 2 Direct mad.wide.u32 PTX: Does this PTX instruction even exist in the toolkit? If yes, what does it compile to? If no, we get a compile error and confirm your statement that there’s no direct PTX for IMAD.WIDE.
Test 3 Pure C++ (uint64_t)a * b + c: Without any inline asm, does the C++ compiler recognize this idiom and emit IMAD.WIDE? This would be the cleanest way to use IMAD.WIDE if it works.
Test 4 256×32->288 multiply chain using mul.wide.u32 + manual accumulation: A real-world pattern showing how a wide multiply with explicit limb shuffling looks. Tests whether the move-back-and-forth between u64 and paired u32 registers (which you mentioned makes the code unwieldy) actually costs SASS instructions or gets optimized away.
Test 5 64×64->128 multiply via four mul.wide.u32: The classical decomposition of a 64-bit multiply into four 32-bit wide multiplies plus accumulation. Compares directly with the simpler mul.lo.u64/mul.hi.u64 approach to see which produces less SASS.
Test 6 Four independent (uint64_t)a*b + c operations in C++: Tests whether ptxas schedules independent IMAD.WIDE operations in parallel (instruction-level parallelism). If the four operations are truly independent at SASS level, they should be issued without serialization.
Test 7 Multiply chain with carry propagation via (p >> 32): This is the pattern I think you’re describing — instead of explicit add.cc/addc.cc, propagate carry by adding the upper 32 bits of one product into the next. If ptxas recognizes this idiom, it should generate a clean chain of IMAD.WIDE with the upper half flowing naturally into the next multiply’s accumulator.
I’ll post the full SASS output once compilation completes. Test 7 is the one I’m most curious about — if pure C++ multiply-add chain becomes a tight IMAD.WIDE sequence at SASS level, that’s the idiom we should use throughout.
code for sm_89
Function : _Z22test7_chain_with_carryPyPjS0_
.headerflags @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
/*0000*/ MOV R1, c[0x0][0x28] ; /* 0x00000a0000017a02 */
/* 0x000fe40000000f00 */
/*0010*/ S2R R10, SR_TID.X ; /* 0x00000000000a7919 */
/* 0x000e220000002100 */
/*0020*/ MOV R3, 0x4 ; /* 0x0000000400037802 */
/* 0x000fe20000000f00 */
/*0030*/ ULDC.64 UR4, c[0x0][0x118] ; /* 0x0000460000047ab9 */
/* 0x000fe20000000a00 */
/*0040*/ SHF.L.U32 R2, R10, 0x2, RZ ; /* 0x000000020a027819 */
/* 0x001fc600000006ff */
/*0050*/ IMAD.WIDE.U32 R4, R10, R3, c[0x0][0x170] ; /* 0x00005c000a047625 */
/* 0x000fc800078e0003 */
/*0060*/ IMAD.WIDE.U32 R2, R2, R3, c[0x0][0x168] ; /* 0x00005a0002027625 */
/* 0x000fe400078e0003 */
/*0070*/ LDG.E R5, [R4.64] ; /* 0x0000000404057981 */
/* 0x000ea8000c1e1900 */
/*0080*/ LDG.E R6, [R2.64] ; /* 0x0000000402067981 */
/* 0x000ea8000c1e1900 */
/*0090*/ LDG.E R11, [R2.64+0x4] ; /* 0x00000404020b7981 */
/* 0x000ee8000c1e1900 */
/*00a0*/ LDG.E R0, [R2.64+0xc] ; /* 0x00000c0402007981 */
/* 0x000f28000c1e1900 */
/*00b0*/ LDG.E R13, [R2.64+0x8] ; /* 0x00000804020d7981 */
/* 0x000f62000c1e1900 */
/*00c0*/ IMAD.MOV.U32 R9, RZ, RZ, RZ ; /* 0x000000ffff097224 */
/* 0x000fe200078e00ff */
/*00d0*/ MOV R12, 0x8 ; /* 0x00000008000c7802 */
/* 0x000fe20000000f00 */
/*00e0*/ IMAD.WIDE.U32 R6, R5, R6, RZ ; /* 0x0000000605067225 */
/* 0x004fca00078e00ff */
/*00f0*/ MOV R8, R7 ; /* 0x0000000700087202 */
/* 0x000fe40000000f00 */
/*0100*/ SHF.L.U32 R7, R10, 0x1, RZ ; /* 0x000000010a077819 */
/* 0x000fe200000006ff */
/*0110*/ IMAD R15, R0, R5.reuse, RZ ; /* 0x00000005000f7224 */
/* 0x090fe400078e02ff */
/*0120*/ IMAD.WIDE.U32 R8, R11, R5, R8 ; /* 0x000000050b087225 */
/* 0x008fe200078e0008 */
/*0130*/ MOV R11, RZ ; /* 0x000000ff000b7202 */
/* 0x000fc60000000f00 */
/*0140*/ IMAD.MOV.U32 R10, RZ, RZ, R9 ; /* 0x000000ffff0a7224 */
/* 0x000fc800078e0009 */
/*0150*/ IMAD.WIDE.U32 R10, R13, R5, R10 ; /* 0x000000050d0a7225 */
/* 0x020fc800078e000a */
/*0160*/ IMAD.WIDE.U32 R4, R7, R12, c[0x0][0x160] ; /* 0x0000580007047625 */
/* 0x000fe200078e000c */
/*0170*/ IADD3 R11, R11, R15, RZ ; /* 0x0000000f0b0b7210 */
/* 0x000fe40007ffe0ff */
/*0180*/ MOV R7, R8 ; /* 0x0000000800077202 */
/* 0x000fc60000000f00 */
/*0190*/ STG.E.64 [R4.64+0x8], R10 ; /* 0x0000080a04007986 */
/* 0x000fe8000c101b04 */
/*01a0*/ STG.E.64 [R4.64], R6 ; /* 0x0000000604007986 */
/* 0x000fe2000c101b04 */
/*01b0*/ EXIT ; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*01c0*/ BRA 0x1c0; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*01d0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01e0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01f0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0200*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0210*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0220*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0230*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0240*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0250*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0260*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0270*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
..........
Function : _Z24test6_parallel_imad_widePyPjS0_S_
.headerflags @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
/*0000*/ MOV R1, c[0x0][0x28] ; /* 0x00000a0000017a02 */
/* 0x000fe40000000f00 */
/*0010*/ S2R R16, SR_TID.X ; /* 0x0000000000107919 */
/* 0x000e220000002100 */
/*0020*/ MOV R5, 0x4 ; /* 0x0000000400057802 */
/* 0x000fe20000000f00 */
/*0030*/ ULDC.64 UR4, c[0x0][0x118] ; /* 0x0000460000047ab9 */
/* 0x000fe20000000a00 */
/*0040*/ MOV R25, 0x8 ; /* 0x0000000800197802 */
/* 0x000fe40000000f00 */
/*0050*/ SHF.L.U32 R16, R16, 0x2, RZ ; /* 0x0000000210107819 */
/* 0x001fca00000006ff */
/*0060*/ IMAD.WIDE.U32 R2, R16, R5, c[0x0][0x168] ; /* 0x00005a0010027625 */
/* 0x000fc800078e0005 */
/*0070*/ IMAD.WIDE.U32 R4, R16.reuse, R5, c[0x0][0x170] ; /* 0x00005c0010047625 */
/* 0x040fe200078e0005 */
/*0080*/ LDG.E R17, [R2.64] ; /* 0x0000000402117981 */
/* 0x000ea6000c1e1900 */
/*0090*/ IMAD.WIDE.U32 R6, R16, R25, c[0x0][0x178] ; /* 0x00005e0010067625 */
/* 0x000fe200078e0019 */
/*00a0*/ LDG.E R0, [R4.64] ; /* 0x0000000404007981 */
/* 0x000ea8000c1e1900 */
/*00b0*/ LDG.E.64 R8, [R6.64] ; /* 0x0000000406087981 */
/* 0x000ea8000c1e1b00 */
/*00c0*/ LDG.E R19, [R2.64+0x4] ; /* 0x0000040402137981 */
/* 0x000ee8000c1e1900 */
/*00d0*/ LDG.E R21, [R2.64+0x8] ; /* 0x0000080402157981 */
/* 0x000f28000c1e1900 */
/*00e0*/ LDG.E R23, [R2.64+0xc] ; /* 0x00000c0402177981 */
/* 0x000f68000c1e1900 */
/*00f0*/ LDG.E R18, [R4.64+0x4] ; /* 0x0000040404127981 */
/* 0x000ee8000c1e1900 */
/*0100*/ LDG.E R20, [R4.64+0x8] ; /* 0x0000080404147981 */
/* 0x000f28000c1e1900 */
/*0110*/ LDG.E R22, [R4.64+0xc] ; /* 0x00000c0404167981 */
/* 0x000f68000c1e1900 */
/*0120*/ LDG.E.64 R10, [R6.64+0x8] ; /* 0x00000804060a7981 */
/* 0x000ee8000c1e1b00 */
/*0130*/ LDG.E.64 R12, [R6.64+0x10] ; /* 0x00001004060c7981 */
/* 0x000f28000c1e1b00 */
/*0140*/ LDG.E.64 R14, [R6.64+0x18] ; /* 0x00001804060e7981 */
/* 0x000f62000c1e1b00 */
/*0150*/ IMAD.WIDE.U32 R8, R17, R0, R8 ; /* 0x0000000011087225 */
/* 0x004fc800078e0008 */
/*0160*/ IMAD.WIDE.U32 R16, R16, R25, c[0x0][0x160] ; /* 0x0000580010107625 */
/* 0x000fca00078e0019 */
/*0170*/ STG.E.64 [R16.64], R8 ; /* 0x0000000810007986 */
/* 0x000fe2000c101b04 */
/*0180*/ IMAD.WIDE.U32 R10, R19, R18, R10 ; /* 0x00000012130a7225 */
/* 0x008fc800078e000a */
/*0190*/ IMAD.WIDE.U32 R12, R21, R20, R12 ; /* 0x00000014150c7225 */
/* 0x010fc800078e000c */
/*01a0*/ IMAD.WIDE.U32 R14, R23, R22, R14 ; /* 0x00000016170e7225 */
/* 0x020fe200078e000e */
/*01b0*/ STG.E.64 [R16.64+0x8], R10 ; /* 0x0000080a10007986 */
/* 0x000fe8000c101b04 */
/*01c0*/ STG.E.64 [R16.64+0x10], R12 ; /* 0x0000100c10007986 */
/* 0x000fe8000c101b04 */
/*01d0*/ STG.E.64 [R16.64+0x18], R14 ; /* 0x0000180e10007986 */
/* 0x000fe2000c101b04 */
/*01e0*/ EXIT ; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*01f0*/ BRA 0x1f0; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*0200*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0210*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0220*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0230*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0240*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0250*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0260*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0270*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
..........
Function : _Z20test5_64bit_via_widePyS_S_
.headerflags @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
/*0000*/ MOV R1, c[0x0][0x28] ; /* 0x00000a0000017a02 */
/* 0x000fe40000000f00 */
/*0010*/ S2R R12, SR_TID.X ; /* 0x00000000000c7919 */
/* 0x000e220000002100 */
/*0020*/ MOV R13, 0x8 ; /* 0x00000008000d7802 */
/* 0x000fe20000000f00 */
/*0030*/ ULDC.64 UR4, c[0x0][0x118] ; /* 0x0000460000047ab9 */
/* 0x000fc80000000a00 */
/*0040*/ IMAD.WIDE.U32 R2, R12, R13, c[0x0][0x168] ; /* 0x00005a000c027625 */
/* 0x001fc800078e000d */
/*0050*/ IMAD.WIDE.U32 R4, R12.reuse, R13, c[0x0][0x170] ; /* 0x00005c000c047625 */
/* 0x040fe400078e000d */
/*0060*/ LDG.E.64 R2, [R2.64] ; /* 0x0000000402027981 */
/* 0x000ea8000c1e1b00 */
/*0070*/ LDG.E.64 R4, [R4.64] ; /* 0x0000000404047981 */
/* 0x000ea2000c1e1b00 */
/*0080*/ SHF.L.U32 R12, R12, 0x1, RZ ; /* 0x000000010c0c7819 */
/* 0x000fca00000006ff */
/*0090*/ IMAD.WIDE.U32 R12, R12, R13, c[0x0][0x160] ; /* 0x000058000c0c7625 */
/* 0x000fc800078e000d */
/*00a0*/ IMAD.WIDE.U32 R8, R3, R4, RZ ; /* 0x0000000403087225 */
/* 0x004fc800078e00ff */
/*00b0*/ IMAD.WIDE.U32 R6, R2, R4, RZ ; /* 0x0000000402067225 */
/* 0x000fc800078e00ff */
/*00c0*/ IMAD.WIDE.U32 R10, R2, R5, R8 ; /* 0x00000005020a7225 */
/* 0x000fc800078e0008 */
/*00d0*/ IMAD.WIDE.U32 R8, R3, R5, RZ ; /* 0x0000000503087225 */
/* 0x000fe200078e00ff */
/*00e0*/ IADD3 R7, P0, R7, R10, RZ ; /* 0x0000000a07077210 */
/* 0x000fc80007f1e0ff */
/*00f0*/ IADD3.X R10, P0, R8, R11, RZ, P0, !PT ; /* 0x0000000b080a7210 */
/* 0x000fe2000071e4ff */
/*0100*/ STG.E.64 [R12.64], R6 ; /* 0x000000060c007986 */
/* 0x000fe6000c101b04 */
/*0110*/ IADD3.X R11, RZ, R9, RZ, P0, !PT ; /* 0x00000009ff0b7210 */
/* 0x000fca00007fe4ff */
/*0120*/ STG.E.64 [R12.64+0x8], R10 ; /* 0x0000080a0c007986 */
/* 0x000fe2000c101b04 */
/*0130*/ EXIT ; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*0140*/ BRA 0x140; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*0150*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0160*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0170*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0180*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0190*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01a0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01b0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01c0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01d0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01e0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01f0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
..........
Function : _Z18test4_chain_4_widePyPjS0_
.headerflags @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
/*0000*/ MOV R1, c[0x0][0x28] ; /* 0x00000a0000017a02 */
/* 0x000fe40000000f00 */
/*0010*/ S2R R10, SR_TID.X ; /* 0x00000000000a7919 */
/* 0x000e220000002100 */
/*0020*/ MOV R3, 0x4 ; /* 0x0000000400037802 */
/* 0x000fe20000000f00 */
/*0030*/ ULDC.64 UR4, c[0x0][0x118] ; /* 0x0000460000047ab9 */
/* 0x000fe20000000a00 */
/*0040*/ SHF.L.U32 R2, R10, 0x2, RZ ; /* 0x000000020a027819 */
/* 0x001fc600000006ff */
/*0050*/ IMAD.WIDE.U32 R4, R10, R3, c[0x0][0x170] ; /* 0x00005c000a047625 */
/* 0x000fc800078e0003 */
/*0060*/ IMAD.WIDE.U32 R2, R2, R3, c[0x0][0x168] ; /* 0x00005a0002027625 */
/* 0x000fe400078e0003 */
/*0070*/ LDG.E R5, [R4.64] ; /* 0x0000000404057981 */
/* 0x000ea8000c1e1900 */
/*0080*/ LDG.E R6, [R2.64] ; /* 0x0000000402067981 */
/* 0x000ea8000c1e1900 */
/*0090*/ LDG.E R8, [R2.64+0x4] ; /* 0x0000040402087981 */
/* 0x000ee8000c1e1900 */
/*00a0*/ LDG.E R13, [R2.64+0x8] ; /* 0x00000804020d7981 */
/* 0x000f22000c1e1900 */
/*00b0*/ MOV R11, 0x8 ; /* 0x00000008000b7802 */
/* 0x000fc40000000f00 */
/*00c0*/ SHF.L.U32 R10, R10, 0x1, RZ ; /* 0x000000010a0a7819 */
/* 0x000fca00000006ff */
/*00d0*/ IMAD.WIDE.U32 R10, R10, R11, c[0x0][0x160] ; /* 0x000058000a0a7625 */
/* 0x000fc800078e000b */
/*00e0*/ IMAD.WIDE.U32 R6, R6, R5, RZ ; /* 0x0000000506067225 */
/* 0x004fc800078e00ff */
/*00f0*/ IMAD.WIDE.U32 R8, R8, R5, RZ ; /* 0x0000000508087225 */
/* 0x008fca00078e00ff */
/*0100*/ IADD3 R7, P0, R7, R8, RZ ; /* 0x0000000807077210 */
/* 0x000fc80007f1e0ff */
/*0110*/ IADD3.X R8, RZ, R9, RZ, P0, !PT ; /* 0x00000009ff087210 */
/* 0x000fe200007fe4ff */
/*0120*/ STG.E.64 [R10.64], R6 ; /* 0x000000060a007986 */
/* 0x000fe2000c101b04 */
/*0130*/ MOV R9, RZ ; /* 0x000000ff00097202 */
/* 0x000fca0000000f00 */
/*0140*/ IMAD.WIDE.U32 R8, R13, R5, R8 ; /* 0x000000050d087225 */
/* 0x010fca00078e0008 */
/*0150*/ STG.E.64 [R10.64+0x8], R8 ; /* 0x000008080a007986 */
/* 0x000fe2000c101b04 */
/*0160*/ EXIT ; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*0170*/ BRA 0x170; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*0180*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0190*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01a0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01b0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01c0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01d0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01e0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01f0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
..........
Function : _Z16test3_cpp_nativePyPjS0_S_
.headerflags @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
/*0000*/ MOV R1, c[0x0][0x28] ; /* 0x00000a0000017a02 */
/* 0x000fe40000000f00 */
/*0010*/ S2R R10, SR_TID.X ; /* 0x00000000000a7919 */
/* 0x000e220000002100 */
/*0020*/ MOV R5, 0x4 ; /* 0x0000000400057802 */
/* 0x000fe20000000f00 */
/*0030*/ ULDC.64 UR4, c[0x0][0x118] ; /* 0x0000460000047ab9 */
/* 0x000fe20000000a00 */
/*0040*/ MOV R11, 0x8 ; /* 0x00000008000b7802 */
/* 0x000fc60000000f00 */
/*0050*/ IMAD.WIDE.U32 R2, R10, R5, c[0x0][0x168] ; /* 0x00005a000a027625 */
/* 0x001fc800078e0005 */
/*0060*/ IMAD.WIDE.U32 R4, R10.reuse, R5, c[0x0][0x170] ; /* 0x00005c000a047625 */
/* 0x040fe400078e0005 */
/*0070*/ LDG.E R3, [R2.64] ; /* 0x0000000402037981 */
/* 0x000ea4000c1e1900 */
/*0080*/ IMAD.WIDE.U32 R6, R10.reuse, R11.reuse, c[0x0][0x178] ; /* 0x00005e000a067625 */
/* 0x0c0fe400078e000b */
/*0090*/ LDG.E R4, [R4.64] ; /* 0x0000000404047981 */
/* 0x000ea8000c1e1900 */
/*00a0*/ LDG.E.64 R6, [R6.64] ; /* 0x0000000406067981 */
/* 0x000ea2000c1e1b00 */
/*00b0*/ IMAD.WIDE.U32 R10, R10, R11, c[0x0][0x160] ; /* 0x000058000a0a7625 */
/* 0x000fc800078e000b */
/*00c0*/ IMAD.WIDE.U32 R8, R3, R4, R6 ; /* 0x0000000403087225 */
/* 0x004fca00078e0006 */
/*00d0*/ STG.E.64 [R10.64], R8 ; /* 0x000000080a007986 */
/* 0x000fe2000c101b04 */
/*00e0*/ EXIT ; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*00f0*/ BRA 0xf0; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*0100*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0110*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0120*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0130*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0140*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0150*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0160*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0170*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
..........
Function : _Z14test2_mad_widePyPjS0_S_
.headerflags @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
/*0000*/ MOV R1, c[0x0][0x28] ; /* 0x00000a0000017a02 */
/* 0x000fe40000000f00 */
/*0010*/ S2R R0, SR_TID.X ; /* 0x0000000000007919 */
/* 0x000e220000002100 */
/*0020*/ MOV R5, 0x4 ; /* 0x0000000400057802 */
/* 0x000fe20000000f00 */
/*0030*/ ULDC.64 UR4, c[0x0][0x118] ; /* 0x0000460000047ab9 */
/* 0x000fe20000000a00 */
/*0040*/ MOV R13, 0x8 ; /* 0x00000008000d7802 */
/* 0x000fc60000000f00 */
/*0050*/ IMAD.WIDE.U32 R2, R0, R5, c[0x0][0x168] ; /* 0x00005a0000027625 */
/* 0x001fc800078e0005 */
/*0060*/ IMAD.WIDE.U32 R4, R0.reuse, R5, c[0x0][0x170] ; /* 0x00005c0000047625 */
/* 0x040fe400078e0005 */
/*0070*/ LDG.E R2, [R2.64] ; /* 0x0000000402027981 */
/* 0x000ea4000c1e1900 */
/*0080*/ IMAD.WIDE.U32 R6, R0, R13, c[0x0][0x178] ; /* 0x00005e0000067625 */
/* 0x000fe400078e000d */
/*0090*/ LDG.E R5, [R4.64] ; /* 0x0000000404057981 */
/* 0x000ea8000c1e1900 */
/*00a0*/ LDG.E.64 R6, [R6.64] ; /* 0x0000000406067981 */
/* 0x000ee2000c1e1b00 */
/*00b0*/ IMAD.WIDE.U32 R8, R2, R5, RZ ; /* 0x0000000502087225 */
/* 0x004fca00078e00ff */
/*00c0*/ IADD3 R10, P0, R6, R8, RZ ; /* 0x00000008060a7210 */
/* 0x008fc80007f1e0ff */
/*00d0*/ IADD3.X R11, R7, R9, RZ, P0, !PT ; /* 0x00000009070b7210 */
/* 0x000fe200007fe4ff */
/*00e0*/ IMAD.WIDE.U32 R8, R0, R13, c[0x0][0x160] ; /* 0x0000580000087625 */
/* 0x000fca00078e000d */
/*00f0*/ STG.E.64 [R8.64], R10 ; /* 0x0000000a08007986 */
/* 0x000fe2000c101b04 */
/*0100*/ EXIT ; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*0110*/ BRA 0x110; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*0120*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0130*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0140*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0150*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0160*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0170*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0180*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0190*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01a0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01b0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01c0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01d0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01e0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*01f0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
..........
Function : _Z18test1_wide_add_u64PyPjS0_S_
.headerflags @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
/*0000*/ MOV R1, c[0x0][0x28] ; /* 0x00000a0000017a02 */
/* 0x000fe40000000f00 */
/*0010*/ S2R R10, SR_TID.X ; /* 0x00000000000a7919 */
/* 0x000e220000002100 */
/*0020*/ MOV R5, 0x4 ; /* 0x0000000400057802 */
/* 0x000fe20000000f00 */
/*0030*/ ULDC.64 UR4, c[0x0][0x118] ; /* 0x0000460000047ab9 */
/* 0x000fe20000000a00 */
/*0040*/ MOV R11, 0x8 ; /* 0x00000008000b7802 */
/* 0x000fc60000000f00 */
/*0050*/ IMAD.WIDE.U32 R2, R10, R5, c[0x0][0x168] ; /* 0x00005a000a027625 */
/* 0x001fc800078e0005 */
/*0060*/ IMAD.WIDE.U32 R4, R10.reuse, R5, c[0x0][0x170] ; /* 0x00005c000a047625 */
/* 0x040fe400078e0005 */
/*0070*/ LDG.E R3, [R2.64] ; /* 0x0000000402037981 */
/* 0x000ea4000c1e1900 */
/*0080*/ IMAD.WIDE.U32 R6, R10.reuse, R11.reuse, c[0x0][0x178] ; /* 0x00005e000a067625 */
/* 0x0c0fe400078e000b */
/*0090*/ LDG.E R4, [R4.64] ; /* 0x0000000404047981 */
/* 0x000ea8000c1e1900 */
/*00a0*/ LDG.E.64 R6, [R6.64] ; /* 0x0000000406067981 */
/* 0x000ea2000c1e1b00 */
/*00b0*/ IMAD.WIDE.U32 R10, R10, R11, c[0x0][0x160] ; /* 0x000058000a0a7625 */
/* 0x000fc800078e000b */
/*00c0*/ IMAD.WIDE.U32 R8, R3, R4, R6 ; /* 0x0000000403087225 */
/* 0x004fca00078e0006 */
/*00d0*/ STG.E.64 [R10.64], R8 ; /* 0x000000080a007986 */
/* 0x000fe2000c101b04 */
/*00e0*/ EXIT ; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*00f0*/ BRA 0xf0; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*0100*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0110*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0120*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0130*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0140*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0150*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0160*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0170*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
..........