Squeezing the last 17.5% out of a compute-bound 256-bit modular arithmetic kernel (sm_89, 82.5% SM throughput)

I have a highly optimized CUDA kernel performing intensive 256-bit modular field arithmetic on RTX 4090 (sm_89). The kernel does repeated modular multiplications, subtractions, and inversions on 4×u64 limb representations, followed by some additional per-element computation. After extensive optimization over several months, Nsight Compute shows 82.5% SM throughput, 19.5% DRAM throughput, and 4.8% L1 throughput — clearly compute-bound.

I believe the remaining 17.5% gap is caused by Read-After-Write dependency stalls in carry-propagated 256-bit integer arithmetic (add.cc.u64/addc.cc.u64 chains). I’m looking for techniques to fill these pipeline bubbles, or confirmation that this is the architectural ceiling for carry-chain-dominated integer workloads on Ada Lovelace.

Workload Profile

Each thread performs ~2,100 modular multiplications (256-bit) and ~8,000 modular subtractions (256-bit) per kernel launch, plus some additional fixed-cost computation per element. The field arithmetic uses a special-form prime where 2^256 ≡ C (mod p) with C being a small constant, enabling single-multiply reduction instead of Barrett (3 multiplies) or Montgomery.

Field Arithmetic Implementation (4×u64 limbs)

ModMult (~60% of kernel time)

// 256×256 → 512-bit schoolbook multiplication using 4 × UMult
#define UMult(r, a, b) {\
  UMULLO(r[0], a[0], b);              \
  UMULLO(r[1], a[1], b);              \
  MADDO(r[1], a[0], b, r[1]);         \
  UMULLO(r[2], a[2], b);              \
  MADDC(r[2], a[1], b, r[2]);         \
  UMULLO(r[3], a[3], b);              \
  MADDC(r[3], a[2], b, r[3]);         \
  MADD(r[4], a[3], b, 0ULL);          \
}

The PTX macros:

#define UMULLO(lo,a,b) asm volatile("mul.lo.u64 %0, %1, %2;" : "=l"(lo) : "l"(a), "l"(b));
#define UMULHI(hi,a,b) asm volatile("mul.hi.u64 %0, %1, %2;" : "=l"(hi) : "l"(a), "l"(b));
#define MADDO(r,a,b,c) asm volatile("mad.hi.cc.u64 %0, %1, %2, %3;" : "=l"(r) : "l"(a), "l"(b), "l"(c) : "memory");
#define MADDC(r,a,b,c) asm volatile("madc.hi.cc.u64 %0, %1, %2, %3;" : "=l"(r) : "l"(a), "l"(b), "l"(c) : "memory");
#define MADD(r,a,b,c)  asm volatile("madc.hi.u64 %0, %1, %2, %3;" : "=l"(r) : "l"(a), "l"(b), "l"(c));

After the 512-bit product, reduction uses the special prime form — one multiply by a small constant instead of full Barrett reduction.

ModSub (~30% of kernel time, branchless)

// Branchless subtraction with conditional add-p
USUBO(r[0], a[0], b[0]);       // sub with borrow-out
USUBC(r[1], a[1], b[1]);       // depends on previous borrow  ← chain
USUBC(r[2], a[2], b[2]);       // depends on previous borrow  ← chain
USUBC(r[3], a[3], b[3]);       // depends on previous borrow  ← chain
// then branchless conditional add-p using borrow mask

ModInv (Bernstein-Yang DivStep62)

~50 multiply-equivalents per inversion, amortized over 1024 elements via Montgomery’s batch inversion trick.

Nsight Compute Results (consistent across 17 kernel launches)

Metric                                            Value
──────────────────────────────────────────────── ──────
sm__throughput.avg.pct_of_peak_sustained_elapsed   82.50%
dram__throughput.avg.pct_of_peak_sustained_elapsed 19.50%
l1tex__throughput.avg.pct_of_peak_sustained_elapsed 4.77%

ptxas Output

ptxas info: Function properties for kernel:
    16464 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info: Used 128 registers, 388 bytes cmem[0]

  • 128 registers, 0 spills — all field arithmetic in registers

  • 16 KB stack — workspace array for batch inversion

  • ~33% occupancy (65536 regs/SM ÷ 128 × 32 = 16 warps/SM)

Compiler Flags

nvcc -O3 -arch=sm_89 --use_fast_math -maxrregcount=0 -Xptxas -dlcm=ca

What I’ve Already Tried

Optimization attempted Result
Hand-written PTX for reduction fusion in ModMult NVCC generates equal or better code
Full ASM block for ModMult 10% slower — NVCC optimizes macro chains better
-maxrregcount=64 to double occupancy Net loss — compute-bound, not latency-bound
Carry-free addition using predicates (carry-lookahead) ~18 instructions vs 4 — instruction count increase negates ILP gain (theoretical analysis, not implemented)
Two-way ILP interleave (two independent computations per thread) Register usage doubles to ~200+, occupancy drops; also CC register clobber prevents actual interleaving of carry chains
Redundant representation (62-bit limbs in 64-bit words) Normalization before final use reintroduces carry chain; extra limb increases multiply cost by ~56%
Karatsuba multiplication (3 sub-multiplies vs 4) No gain on GPU — extra additions negate saved multiply
Various loop unroll depths and block sizes Current configuration is optimal
Minimized `__syncthreads One remaining (required for shared data init)

The Core Problem

As discussed in this 2013 thread by SPWorley and @njuffa PTX has only one Condition Code (CC) register per thread. In the ModMult carry chain:

MADDO(r[1], a[0], b, r[1]);    // sets CC.CF
MADDC(r[2], a[1], b, r[2]);    // reads CC.CF, sets CC.CF  ← must wait
MADDC(r[3], a[2], b, r[3]);    // reads CC.CF, sets CC.CF  ← must wait
MADD(r[4], a[3], b, 0ULL);     // reads CC.CF              ← must wait

My kernel’s main loop has two independent computation paths per iteration. These are completely independent, but their carry chains cannot be interleaved because they share the single CC register. For comparison, Intel x86 solved this with adcx/adox — two independent carry chains using CF and OF flags.

Questions

  1. Is there any sm_89 trick to get ILP across carry chains? Any way to use predicate registers as auxiliary carry storage, or scheduling hints that help the compiler fill CC-dependency bubbles?

  2. Has anyone measured carry-free (predicate-based) 256-bit arithmetic on Ada Lovelace that actually outperformed carry-propagated arithmetic? The instruction count increase (~4× for addition) seems prohibitive, but I haven’t seen real benchmarks.

  3. Is 82.5% SM throughput the architectural ceiling for mul.lo.u64/mad.hi.cc.u64-heavy kernels on sm_89? njuffa noted that “integer multiply throughput tends to be the limiting factor” for multi-precision codes. If the bottleneck is integer multiply throughput rather than carry stalls, then no restructuring helps.

  4. What SASS patterns indicate CC-dependency stalls vs integer multiply throughput limits? When inspecting nvdisasm output, how can I distinguish between these two bottlenecks?

  5. Any other approach I haven’t considered for squeezing more throughput out of a 256-bit modular multiply dominated kernel?

Environment

  • GPU: NVIDIA GeForce RTX 4090 (sm_89, Ada Lovelace)

  • CUDA Toolkit: 12.4

Happy to share SASS dumps or additional Nsight metrics if helpful.

Some general remarks:

(1) I am generally suspicious of the use of asm volatile. By my understanding, leaving off the volatile gives the compiler greater wiggle room for valid code transformations. The backend compiler ptxas likely has hardware knowledge built into it that is not available to the general public, and use of volatile may inhibit some transformations it could otherwise utilize. I would suggest using a single asm block (without volatile) for the entire 256-bit mul.wide operation.

(2) GPUs use a 32-bit architecture with 64-bit addressing capability. 64-bit integer operations at PTX level are therefore emulated. In consequence, it is very likely that compositing wide arithmetic from emulated instructions is less efficient than compositing them from PTX operations that map 1:1 to native hardware instructions.

(3) There are multiple ways to decompose wide multiplications into basic hardware operations. On all platforms (not just GPUs) it is usually (but not always) a good idea to choose a decomposition that minimizes the use of carry-flag operations. With the exception of sm_5x and sm_6x the various GPU architectures >= sm-30 have used a fairly similar set of multiplication operations in hardware, but the optimal decomposition technique will likely depend on GPU architecture. One area of likely differences in optimality between architectures is in the use of multiply-add primitives versus use of separate adds.

(4) My memory is hazy, but I seem to recall that the most recent GPU architectures abolished the use of a carry flag for multi-word arithmetic at hardware level and replaced it with any of the seven (?) programmer-useable predicate registers. Examining the generated code a SASS level is definitely needed to make sure the compiler massages performance-critical code into something close to what one would ideally expect.

Some years ago, I posted code for various wide multiplies (I think up to 128x128->256 bits) optimized for various older GPU architecture in these forums. I cannot find the relevant post immediately, but it may contain some ideas useful for the issue at hand.

To set expectations, I will mention that it is generally impossible to achieve 100% of theoretical throughput for compute-bound kernels, and typically one winds up with something like 85% of theoretical maximum as the practically achievable upper bound. While further experimentation seems warranted here, it is entirely possible that only small incremental gains (if any) can be achieved given the high percentage of theoretical throughput you have already achieved.

The comment says you are using a 256x256->512 bit multiplication (mul.wide), but the code in UMult is actually just a 256x256->256 multiplication. Which operation is actually being used?

With the sm_35 architecture it was possible to perform a 256 x 256 → 512 bit mul.wide operation in 128 IMAD.U32.U32{.HI} operations, most of which propagated carries through the .X flag. Plus 28 IADDs for carry propagation through the most significant bits. Minimal overhead for some loads and stores. Carry propagation could keep up easily with the multiplier throughput.

Since full 32-bit IMAD hardware was restored with the sm_70 architecture (Volta) it seems from the generated SASS that this operation is no longer natively supported, requiring emulation that uses additional IADDs for carry propagation. The IMAD hardware does not seem to have changed much (if at all) from sm_70 to sm_89.

Assuming your code uses the full 256 x 256 → 512 bit multiply, how many instructions does that resolve to?

Is this what you had in mind Norbert?

Thanks. Yes, I am 99% sure that the code at the link you provided is what I had in mind. It has been such a long time that I have to spend some time to re-acquaint myself with the possible variants.

Thanks njuffa, this is incredibly helpful.

On asm volatile: You’re right — I’m using asm volatile with "memory" clobber on every single arithmetic macro. That’s inherited from the original codebase and I never questioned it. I’ll immediately test removing volatile and the "memory" clobber, and also try consolidating the entire ModMult into a single asm block. If ptxas has internal knowledge about scheduling mad.hi.cc.u64 chains, I’ve been actively preventing it from using that knowledge. This could be the key to unlocking some of that 17.5%.

On 32-bit vs 64-bit: That’s an important point I hadn’t fully considered. Currently my 256×256->512 multiply uses 4×u64 limbs, which means the PTX mul.lo.u64 and mad.hi.cc.u64 instructions are being emulated as multiple 32-bit IMADs by ptxas. If I rewrite using 8×u32 limbs with mad.lo.cc.u32 / madc.hi.cc.u32, ptxas would have direct 1:1 mapping to native IMAD.U32 hardware and full visibility of the dependency graph. The instruction count would increase but ptxas might schedule them better. I’ll prototype this and compare SASS output.

On carry flag vs predicates in recent architectures: This is the most intriguing point. If sm_89 uses predicate registers instead of a dedicated CC flag at SASS level, then ptxas could potentially assign different predicate registers to different carry chains — enabling exactly the ILP I’ve been trying to achieve. But only if I stop blocking it with asm volatile. I’ll dump the SASS and check what the carry mechanism actually looks like on sm_89.

On the actual multiply: Yes, it’s a full 256×256->512 bit multiplication (schoolbook). The decomposition is:

// 4 partial products, each 256×64 → 320 bit:
UMult(r512,     a, b[0]);    // r512[0..4] = a × b[0]
UMult(t,        a, b[1]);    // t[0..4]    = a × b[1]
// accumulate t into r512[1..5] with carry chain
UMult(t,        a, b[2]);    // t[0..4]    = a × b[2]
// accumulate t into r512[2..6] with carry chain
UMult(t,        a, b[3]);    // t[0..4]    = a × b[3]
// accumulate t into r512[3..7] with carry chain

// Then reduce 512→256 using prime-specific identity (one multiply by small constant)

Each UMult is internally: 4× mul.lo.u64 + 3× mad.hi.cc.u64 + 1× madc.hi.u64 = 8 instructions. Four UMults = 32 multiply instructions, plus ~12 addc.cc.u64 for accumulation, plus ~8 for reduction. Total ~52 PTX instructions, which expand to significantly more at SASS level due to 64->32 bit emulation.

FWIW, I am not yet convinced that carry propagation is the problem in this code. I first worked on long-integer arithmetic for cryptography in the early 1990s and seem to revisit the topic about once per decade, so I am not really acquainted with the state of the art at this point. I vaguely recall research out of Ireland (Trinity College?) on optimized long-integer arithmetic for GPUs (for use in cryptography, I think), sometime in the decade 2010 - 2020. Might be worth tracking down.

I am intrigued by the remark about “cheap reduction via prime-specific identity”. Would you happen to have a literature reference at hand for that? From the 1990s I recall the trick of Mohan & Adiga, which requires the most significant half of the bits of the modulus to be all 1s, but I assume that what you are using is something different. I think this is the reference I am thinking of:

S. B. Mohan and B. S. Adiga, “Fast Algorithms for Implementing RSA Public Key Cryptosystem.” Electronic Letters, Vol. 21, No. 7, 1985, p. 761.

Regarding the Irish research — I believe you may be thinking of the work from COSIC at KU Leuven (Belgium) or possibly the SIDH/SIKE implementations from researchers associated with Trinity College Dublin and Microsoft Research. There was significant work on GPU-accelerated modular arithmetic for elliptic curve cryptography in that timeframe. I’ll track down the specific papers.

On the prime-specific reduction: The prime I’m working with has the form p = 2^256 - C where C is a small constant (fits in 33 bits). This means:

2^256 ≡ C (mod p)

So for a 512-bit product [high_256 : low_256], the reduction becomes:

result = low_256 + high_256 * C

Since C is small (33 bits), the multiply high_256 * C produces at most 289 bits. If there’s overflow beyond 256 bits after the addition, we reduce again with another single multiply by C. At most two rounds are needed.

This is different from Mohan & Adiga’s approach. It’s specific to generalized Mersenne primes or pseudo-Mersenne primes. The technique is well described in:

  • NIST FIPS 186-4, Appendix D.2 — “Efficient Arithmetic for Special Primes” (describes reduction for the NIST primes which have similar structure)

  • Guide to Elliptic Curve Cryptography (Hankerson, Menezes, Vanstone), Section 2.2.6 — “Modular reduction for special primes”

  • For this specific prime form, the original observation goes back to Crandall (1992): R. Crandall, “Method and apparatus for public key exchange in a cryptographic system”, U.S. Patent 5,159,632 — which introduced the term “Crandall number” for primes of the form 2^n - c with small c.

The key insight is that the cost of reduction drops from 3 full 256-bit multiplications (Barrett) or 2 (Montgomery) to essentially one multiply by a 33-bit constant, which is a ~8x reduction in multiply operations for the reduction step alone.

Regarding your broader point — if the bottleneck is integer multiply throughput rather than carry stalls, then the only path forward would be reducing the total number of multiplications per kernel iteration (algorithmic change), or finding a decomposition that maps better to the native IMAD hardware on sm_89. I’ll look into the SASS output to check actual IMAD utilization and stall reasons. Would sm__sass_inst_executed_op_integer_pred_on or similar per-pipe metrics help identify the bottleneck?

Quick update on the volatile experiments you suggested:

Test 1 — Removed all volatile and "memory" clobbers from the arithmetic macros: Got a 6% speed increase, but the kernel produces incorrect results. The frontend compiler reorders carry-chain instructions (add.cc/addc.cc) since it doesn’t see the CC dependency in the operand constraints.

Test 2 — Tried "cc" clobber as middle ground: NVCC rejects it — "The cc clobber constraint is not supported in device code".

Test 3 — Single large asm block (without volatile) for the entire 256×256->512 multiply + reduction, using .reg directives internally (following your umul128wide pattern): Compiles, produces correct results, but no speed improvement over the original macro version with volatile.

So it appears that ptxas schedules the SASS identically regardless of whether it receives one large PTX block or many small ones. The 6% gain from removing volatile came from the frontend compiler interleaving carry-dependent instructions with unrelated work — which happened to be correct most of the time but not always.

This suggests that either (a) ptxas is already doing optimal scheduling within the carry-chain constraints, or (b) as you suggested, carry propagation may not actually be the primary bottleneck and integer multiply throughput itself is the limiting factor. I’m going to dig into SASS-level pipe utilization metrics to distinguish between these two possibilities.

Perhaps you could compare the 6% faster version and correct it again by hand?

If could be worth looking into Nsight Compute Warp State to see what is preventing warps from doing more compute. And at which locations in the program.

Are there alternative ways to compute multiplications? Using floating point for simulation. Putting together 8-bit Integer Tensor Core MMAs. Even if the speed is 50x worse, if your warps mix their approach (4 warps using the alternative approach), you could get an overall 8% speed-up.

As I said, you would need to use a single asm block for the multiplication code to get the correct dependencies through the carry flag. In general, carry flag dependencies cannot reliably track between asm blocks. Making that work by use of asm volatile is a hack, IMHO. It’s a “happens to work at this time” rather than a “guaranteed to work by design” scenario.

I took some time to look into IMADs with carry flag on sm_89, and what I observe looks quite bizarre to me at this time. Either the compiler or the hardware are not handling the various flavors in an orthogonal fashion. The use of predicate registers to transport the carry is apparent and should allow the compiler to intersperse code from multiple carry chains, giving the instruction scheduler maximum freedom.

My minimal test bed is like this:

__device__ uint64_t foo (uint64_t a, uint64_t b, uint64_t c)
{
  uint64_t res;
  asm ("{\n\t"
        ".reg .u32 alo, blo, ahi, bhi, clo, chi;\n\t"
        "mov.b64         {alo,ahi}, %1;\n\t"
        "mov.b64         {blo,bhi}, %2;\n\t"
        "mov.b64         {clo,chi}, %3;\n\t"
        "mad.lo.cc.u32   clo, alo, blo, clo;\n\t" /* generate carry */
        "madc.lo.u32     chi, ahi, bhi, chi;\n\t" /* consume carry */
        "mov.b64         %0, {clo,chi};\n\t"
        "}"
        : "=l"(res) : "l"(a), "l"(b), "l"(c));
  return res;
}

My expectation was that as I would toggle each IMAD between the .HI and .LO flavors, the SASS would stay the same except for the hi / lo attributes. What I see instead is this:

(1) .LO, .LO (as shown above)

foo(unsigned long, unsigned long, unsigned long):
 IMAD R3, R4, R6, RZ 
 IADD3 R4, P0, R3, R8, RZ 
 IMAD.X R5, R5, R7, R9, P0 
 RET.ABS.NODEC R20 0x0 

(2) .LO, .HI

foo(unsigned long, unsigned long, unsigned long):
 IMAD R3, R4, R6, RZ 
 IMAD.HI.U32 R0, R5, R7, RZ 
 IADD3 R4, P0, R3, R8, RZ 
 IADD3.X R5, R0, R9, RZ, P0, !PT 
 RET.ABS.NODEC R20 0x0 

(3) .HI, .LO

foo(unsigned long, unsigned long, unsigned long):
 MOV R10, RZ 
 MOV R11, R8 
 IMAD.HI.U32 R4, P0, R4, R6, R10 
 IMAD.X R5, R5, R7, R9, P0 
 RET.ABS.NODEC R20 0x0 

(4) .HI,.HI

foo(unsigned long, unsigned long, unsigned long):
 MOV R10, RZ 
 IMAD.HI.U32 R0, R5, R7, RZ 
 MOV R11, R8 
 IMAD.HI.U32 R4, P0, R4, R6, R10 
 IADD3.X R5, R0, R9, RZ, P0, !PT 
 RET.ABS.NODEC R20 0x0 

It is obvious that IADD3 is used as a helper to generate or consume a carry, but the fact that this happens in different ways for each of the four combinations seems very odd to me. Sometimes the carry information is generated by the IMAD itself, sometimes the IMAD is split into an IMAD demoted to effective IMUL and the carry generated via separate IADD3.

This non-orthogonality makes devising an optimal decomposition of a multi-word multiplication into IMADs very challenging. Obviously we would want mostly IMAD instructions and as few additional IADD3 helper instructions as possible.

This is a fantastic find. The non-orthogonality is really surprising — I would never have expected that the carry generation/consumption mechanism differs depending on the .LO/.HI combination.

From your four test cases, it looks like the .HI.LO path (case 3) is the most efficient for a carry-propagating IMAD chain: IMAD.HI.U32 directly generates the predicate carry (P0), and IMAD.X directly consumes it — only 2 actual compute instructions with no IADD3 helpers at all (the 2 MOVs are just data setup, not carry-related).

In contrast, the .LO.LO path (case 1) needs an extra IADD3 just to extract the carry from the low multiply.

My current 256×256->512 multiply uses PTX mul.lo.u64 / mul.hi.u64 / add.cc.u64 / addc.cc.u64, which ptxas decomposes into 32-bit IMADs. I have no control over which .LO/.HI combinations it chooses. If there’s an optimal decomposition that minimizes the IADD3 overhead, writing the multiply directly in 32-bit PTX (using mad.hi.cc.u32 / madc.lo.u32 etc.) would let me control the exact sequence.

Would you be willing to share the optimal 128×128->256 or 256×64->320 decomposition you find? That would directly map to my UMult building block which computes a[0..3] × b_scalar → r[0..4] (256×64->320 bit). That’s the inner kernel — everything else is accumulation.

Also very interesting that the carry is transported via predicate registers (P0) at SASS level. That confirms your earlier comment about carry flags being replaced by predicates on recent architectures. If ptxas uses different predicates for different carry chains, ILP across chains might actually be possible at SASS level — even though it’s not expressible in PTX.

You’re absolutely right, and I should have mentioned — I did implement the single asm block approach based on your earlier advice. I converted the entire 256×256->512 multiply + reduction into one asm block (without volatile), using .reg directives for internal temporaries, following the pattern from your umul128wide code.

The good news: it compiles and produces correct results. The bad news: no measurable performance difference versus the original multi-block asm volatile approach.

I also converted all three ModSub256 variants into single asm blocks without volatile — same result: correct output, identical performance.

This seems to confirm that ptxas is already scheduling the carry chains optimally regardless of how they arrive — whether as one big block or many small volatile blocks. The scheduling freedom doesn’t help because the carry dependencies are real and unavoidable.

Your IMAD .LO/.HI orthogonality investigation seems much more promising — if there’s a decomposition that produces fewer IADD3 helpers, that would directly reduce instruction count rather than trying to schedule around dependencies that can’t be hidden.

My working hypothesis for now is that the performance of multi-word integer multiplication on modern GPUs is limited by dynamic instruction count.

Out of the partial products, clearly half are going to be produced by IMAD.LO operation, with the other half produced by IMAD.HI operations. While we are free to choose in which order these partial products are summed into the dot product that generates each word of the result (any such choice may impact register pressure, though), it is not readily apparent that one can reduce overall IADD3 overhead, as a favorable transition at one point would seem to force an unfavorable one elsewhere in the sequence.

At present I am trying to come up with a mental model of hardware multiplier operation that would explain my observations and would help inform an optimal coding strategy. Is what Imam seeing likely due to deficiencies in compiler code generation, or is the use of “helper instructions” inevitable due to some non-orthogonality of the hardware?

I am always happy to share code if and when I can get it into working shape. I tend to have multiple projects going on at any given time, in various stages of completion, with myriad code sketches and snippets piling up from experimenting with or thinking through various concepts, etc. In other words, my virtual workbench is quite messy, and the status of each project usually not well documented.

That’s a really insightful hypothesis. If the bottleneck is dynamic instruction count rather than scheduling/latency, then the optimization target shifts from “hiding carry stalls” to “minimizing total SASS instructions per multiply.”

For context, here’s what my current 256×64->320 building block (UMult) looks like in PTX — this is the inner kernel that gets called 4 times per full 256×256 multiply:

mul.lo.u64   r0, a0, b      // 1
mul.hi.u64   c,  a0, b      // 2
mul.lo.u64   r1, a1, b      // 3
add.cc.u64   r1, r1, c      // 4
mul.hi.u64   c,  a1, b      // 5
mul.lo.u64   r2, a2, b      // 6
addc.cc.u64  r2, r2, c      // 7
mul.hi.u64   c,  a2, b      // 8
mul.lo.u64   r3, a3, b      // 9
addc.cc.u64  r3, r3, c      // 10
mul.hi.u64   r4, a3, b      // 11
addc.u64     r4, r4, 0      // 12

That’s 12 PTX instructions per UMult, but each 64-bit mul/mad expands to multiple 32-bit IMADs at SASS level. So 4× UMult = 48 PTX instructions for the schoolbook multiply alone, expanding to potentially 100+ SASS instructions depending on how many IADD3 helpers get inserted.

If you do find a decomposition pattern that reduces the IADD3 overhead — even by a few instructions per UMult — I’d be very interested to test it. At ~2,100 ModMult calls per kernel invocation, even saving 4 SASS instructions per multiply would eliminate ~8,400 instructions per thread. At 4M threads that’s a meaningful reduction.

No rush at all — I know how the messy workbench goes. Really appreciate you digging into this.

@njuffa

I ran a direct SASS comparison between 64-bit PTX and manual 32-bit PTX for a 256×64->320 multiply building block on my RTX 4090:

Version A — Current code (64-bit PTX: mul.lo.u64 / mul.hi.u64 / add.cc.u64): 38 SASS compute instructions (lots of IADD3/IADD3.X helpers, MOVs)

Version B — Manual 32-bit PTX (mad.lo.cc.u32 / madc.hi.u32 pattern): 15 SASS compute instructions (clean IMAD.WIDE.U32 / IMAD chains, zero IADD3 overhead)

That’s a 60% reduction in dynamic instruction count for the same operation. The 32-bit PTX maps almost 1:1 to native IMAD hardware, while the 64-bit PTX generates massive decomposition overhead.

This strongly supports your hypothesis that dynamic instruction count is the bottleneck, not carry-chain latency. If the full 256×256->512 multiply + reduction can be rewritten in 32-bit PTX with the same efficiency ratio, that could be a significant throughput improvement.

I attempted a full 32-bit rewrite of the complete ModMult but got incorrect results — the carry propagation in the reduction phase has a subtle bug I haven’t tracked down yet. Would appreciate your guidance on the optimal decomposition when you have time.

I ran four additional SASS tests on sm_89
Here are the results that might help your analysis:

Test 1 — Clean 128×32 carry chain (mad.lo.cc.u32 / madc.hi.u32, 4 limbs × 1 scalar): 8 SASS compute instructions: 4× IMAD.WIDE.U32, 2× IMAD.MOV, 1× IADD3 P0, 1× IADD3.X. The mad.lo.cc/madc.hi pairs fuse cleanly into IMAD.WIDE.U32. Only the final accumulation needs IADD3 helpers.

Test 2 — Two independent carry chains interleaved in PTX (chain A: a[0..3]*s, chain B: b[0..3]*t, written with alternating A/B instructions): 16 SASS compute instructions. ptxas did NOT interleave the two chains — it serialized them with many extra MOV instructions for data shuffling. No evidence of different predicate registers being used for separate carry chains. Both chains use IMAD.WIDE.U32 without any predicate-based carry at all — the carry propagation happens implicitly through the IMAD.WIDE accumulator.

Test 3 — 256-bit addition using 32-bit PTX (add.cc.u32 / addc.cc.u32 / addc.u32, 8 limbs): 4 SASS compute instructions: IADD3 P0, IADD3.X P0, IADD3.X P0, IADD3.X.

Test 4 — 256-bit addition using 64-bit PTX (add.cc.u64 / addc.u64, 4 limbs): 4 SASS compute instructions: identical to Test 3.

Key observations:

  1. For addition, 32-bit vs 64-bit PTX produces identical SASS. The instruction count advantage of 32-bit PTX is multiplication-specific.

  2. ptxas does not exploit ILP across independent carry chains on sm_89, even when explicitly written as interleaved in PTX. It serializes them.

  3. The IMAD.WIDE.U32 instruction appears to handle carry propagation internally without predicates in the multiply case — predicates (P0) only appear in IADD3 helpers and IMAD.HI variants.

Combined with my earlier finding (UMult: 38 SASS instructions with 64-bit PTX vs 15 with 32-bit PTX), the data strongly suggests that the win from 32-bit PTX comes purely from avoiding the 64->32 decomposition overhead in ptxas, not from any ILP or carry-chain optimization.

sass_sm89_test.txt (4.5 KB)

I have to correct myself on my earlier recommendation to use IMAD.{HI | LO}.U32 as the basic hardware primitive for multi-word multiplies .

With sm_70 the IMAD.WIDE instruction was introduced, and it has been carried forward from then until now at sm_120. The compiler tries to use this aggressively for wider multiplies. This instruction computes the full product of two 32-bit operands, adds the result to a 64-bit operand (taken from an aligned register pair), and returns a 64-bit result (to an aligned register pair). So clearly this is the intended hardware primitive for wide multiplies for CC >= 7.0.

My best guess is that IMAD.{LO|HI}.U32 are now treated in the hardware as specializations of this operation, in that they return the low half or high half of the IMAD.WIDE result. This would also explain the trouble when trying to chain IMAD.U32 operations with carries: only the IMAD.LO.U32 can receive a carry, and only the IMAD.HI.U32 can produce a carry. Other flavors involving carry propagation need to be emulated with an additional IADD3. This mental model perfectly jibes with the four test cases I looked at earlier.

Best I can tell, IMAD.WIDE cannot be expressed directly at PTX level (at least I could find no matching PTX instruction in the latest manual from April 2026). It can be expressed as a MUL.WIDE followed by a 64-bit addition, and then relying on ptxas to merge that into the desired IMAD.WIDE. It is somewhat unwieldy to use as a building block for wide multiplies: I find myself mapping back and forth between 64-bit operands and pairs of 32-bit registers with mov.b64 to affect the addition of the overlapping columns partial results, making the code hard to follow. I will have to think a bit more about the most advantageous idioms one could use.

The use of IMAD.WIDE can reduce explicit carry propagation by up to 50%, with half of the propagation being absobed into the 64-bit addition that is part of the operation.

I wonder if one can somehow make use of the `__int128_t` datatype (2x 128bit instead of 4x 64bit), or at least get inspired by the instruction sequences used to add and multiply `__int128_t`.

For example, this kernel

__global__
void foo(const __int128_t* a, const __int128_t* b, __int128_t* c){
    c[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];
}

with cuda 12.4.1, sm_89 compiles to

foo(__int128 const*, __int128 const*, __int128*):
 MOV R1, c[0x0][0x28] 
 S2R R0, SR_TID.X 
 MOV R17, 0x10 
 ULDC.64 UR4, c[0x0][0x118] 
 IMAD.WIDE.U32 R4, R0, R17, c[0x0][0x160] 
 IMAD.WIDE.U32 R8, R0, R17, c[0x0][0x168] 
 LDG.E.128 R4, [R4.64] 
 LDG.E.128 R8, [R8.64] 
 IMAD.WIDE.U32 R2, R9, R4, RZ 
 IMAD.WIDE.U32 R12, P0, R5, R8, R2 
 IMAD.WIDE.U32 R2, R8, R4, RZ 
 IADD3.X R15, RZ, RZ, RZ, P0, !PT 
 MOV R14, R13 
 IADD3 R19, P0, R3, R12, RZ 
 IMAD R3, R9, R6, RZ 
 IMAD.WIDE.U32.X R12, R5, R9, R14, P0 
 MOV R9, R19 
 IMAD R3, R7, R8.reuse, R3 
 IMAD.WIDE.U32 R12, R6, R8, R12 
 MOV R8, R2 
 IMAD R7, R11, R4, RZ 
 IADD3 R13, R13, R3, RZ 
 IMAD R7, R5, R10.reuse, R7 
 IMAD.WIDE.U32 R10, R4, R10, R12 
 IMAD.WIDE.U32 R4, R0, R17, c[0x0][0x170] 
 IADD3 R11, R11, R7, RZ 
 STG.E.128 [R4.64], R8 
 EXIT
{
	ld.param.u64 	%rd1, [foo(__int128 const*, __int128 const*, __int128*)_param_0];
	ld.param.u64 	%rd2, [foo(__int128 const*, __int128 const*, __int128*)_param_1];
	ld.param.u64 	%rd3, [foo(__int128 const*, __int128 const*, __int128*)_param_2];
	cvta.to.global.u64 	%rd4, %rd3;
	cvta.to.global.u64 	%rd5, %rd2;
	cvta.to.global.u64 	%rd6, %rd1;
	mov.u32 	%r1, %tid.x;
	mul.wide.u32 	%rd7, %r1, 16;
	add.s64 	%rd8, %rd6, %rd7;
	ld.global.v2.u64 	{%rd9, %rd10}, [%rd8];
	add.s64 	%rd13, %rd5, %rd7;
	ld.global.v2.u64 	{%rd14, %rd15}, [%rd13];
	mul.lo.s64 	%rd18, %rd14, %rd10;
	mul.hi.u64 	%rd19, %rd14, %rd9;
	add.s64 	%rd20, %rd19, %rd18;
	mul.lo.s64 	%rd21, %rd15, %rd9;
	add.s64 	%rd22, %rd4, %rd7;
	add.s64 	%rd23, %rd20, %rd21;
	mul.lo.s64 	%rd24, %rd14, %rd9;
	st.global.v2.u64 	[%rd22], {%rd24, %rd23};
	ret;
}

Comparing with what NVIDIA has crafted to implement unsigned __int128 arithmetic is certainly going to be valuable at some point.

My preferred approach to solving technical issues is first to try as hard as I can on my own to find a good a solution until I am reasonably sure that I cannot improve it further for the time being.

Only then do I compare with other people’s work, with the three potential outcomes (1) appears similar or identical to what I came up with (2) looks / runs worse than what I came up with (3) wow! these people know / realized something I overlooked, let me study their solution so I can understand what I missed so I can learn the technique or soak up the additional information.

I use this approach so as to not unduly restrict my mind to wandering the paths that others have trodden before me, following the conventional wisdom.

Thank you both @njuffa @striker159 for these clarifications

The IMAD.WIDE primitive being the actual hardware target makes perfect sense and explains why my manual mad.lo.cc.u32 / madc.hi.u32 approach didn’t deliver the expected speedup.

When I integrated it into the real kernel, throughput actually dropped by about 4% compared to the original the explicit carry chains likely prevented ptxas from coalescing into IMAD.WIDE.

Im going to follow your suggested aproach
rather than write 32-bit PTX with explicit carries, write the multiply at a higher level using mul.wide.u32 + 64-bit addition, and trust ptxas to coalesce into IMAD.WIDE.

A few specific questions before I attempt the rewrite…

  1. Should the accumulator be u64 or kept as paired u32 registers?
    The NVIDIA-generated code uses 64-bit operands throughout.
    Does ptxas handle the column accumulation better when operands are presented as 64-bit?

  2. For my reduction step (multiply by a small constant that fits in ~33 bits):
    would you express this as a sequence of mul.wide.u32 limb[i], constant + accumulation in 64-bit, or is there a better idiom for “multiply a wide integer by a constant smaller than 64 bits”?

  3. Final carry-out of the 256-bit result:
    even with IMAD.WIDE absorbing most carries, the final accumulation across the limbs eventually needs some explicit carry propagation.
    Is there a recommended way to minimize this, or does it not matter much in practice?

Id rather learn the right idiom from you than continue trial and error with PTX patterns that look correct but get rejected by ptxas optimizer.


Test 1 mul.wide.u32 + add.u64 in a single asm block: Does ptxas coalesce the explicit mul.wide followed by 64-bit add into a single IMAD.WIDE.U32? This is the most direct expression of the operation in PTX.

Test 2 Direct mad.wide.u32 PTX: Does this PTX instruction even exist in the toolkit? If yes, what does it compile to? If no, we get a compile error and confirm your statement that there’s no direct PTX for IMAD.WIDE.

Test 3 Pure C++ (uint64_t)a * b + c: Without any inline asm, does the C++ compiler recognize this idiom and emit IMAD.WIDE? This would be the cleanest way to use IMAD.WIDE if it works.

Test 4 256×32->288 multiply chain using mul.wide.u32 + manual accumulation: A real-world pattern showing how a wide multiply with explicit limb shuffling looks. Tests whether the move-back-and-forth between u64 and paired u32 registers (which you mentioned makes the code unwieldy) actually costs SASS instructions or gets optimized away.

Test 5 64×64->128 multiply via four mul.wide.u32: The classical decomposition of a 64-bit multiply into four 32-bit wide multiplies plus accumulation. Compares directly with the simpler mul.lo.u64/mul.hi.u64 approach to see which produces less SASS.

Test 6 Four independent (uint64_t)a*b + c operations in C++: Tests whether ptxas schedules independent IMAD.WIDE operations in parallel (instruction-level parallelism). If the four operations are truly independent at SASS level, they should be issued without serialization.

Test 7 Multiply chain with carry propagation via (p >> 32): This is the pattern I think you’re describing — instead of explicit add.cc/addc.cc, propagate carry by adding the upper 32 bits of one product into the next. If ptxas recognizes this idiom, it should generate a clean chain of IMAD.WIDE with the upper half flowing naturally into the next multiply’s accumulator.

I’ll post the full SASS output once compilation completes. Test 7 is the one I’m most curious about — if pure C++ multiply-add chain becomes a tight IMAD.WIDE sequence at SASS level, that’s the idiom we should use throughout.

code for sm_89
	Function : _Z22test7_chain_with_carryPyPjS0_
.headerflags	@"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
    /*0000*/                   MOV R1, c[0x0][0x28] ;                                   /* 0x00000a0000017a02 */
                                                                                        /* 0x000fe40000000f00 */
    /*0010*/                   S2R R10, SR_TID.X ;                                      /* 0x00000000000a7919 */
                                                                                        /* 0x000e220000002100 */
    /*0020*/                   MOV R3, 0x4 ;                                            /* 0x0000000400037802 */
                                                                                        /* 0x000fe20000000f00 */
    /*0030*/                   ULDC.64 UR4, c[0x0][0x118] ;                             /* 0x0000460000047ab9 */
                                                                                        /* 0x000fe20000000a00 */
    /*0040*/                   SHF.L.U32 R2, R10, 0x2, RZ ;                             /* 0x000000020a027819 */
                                                                                        /* 0x001fc600000006ff */
    /*0050*/                   IMAD.WIDE.U32 R4, R10, R3, c[0x0][0x170] ;               /* 0x00005c000a047625 */
                                                                                        /* 0x000fc800078e0003 */
    /*0060*/                   IMAD.WIDE.U32 R2, R2, R3, c[0x0][0x168] ;                /* 0x00005a0002027625 */
                                                                                        /* 0x000fe400078e0003 */
    /*0070*/                   LDG.E R5, [R4.64] ;                                      /* 0x0000000404057981 */
                                                                                        /* 0x000ea8000c1e1900 */
    /*0080*/                   LDG.E R6, [R2.64] ;                                      /* 0x0000000402067981 */
                                                                                        /* 0x000ea8000c1e1900 */
    /*0090*/                   LDG.E R11, [R2.64+0x4] ;                                 /* 0x00000404020b7981 */
                                                                                        /* 0x000ee8000c1e1900 */
    /*00a0*/                   LDG.E R0, [R2.64+0xc] ;                                  /* 0x00000c0402007981 */
                                                                                        /* 0x000f28000c1e1900 */
    /*00b0*/                   LDG.E R13, [R2.64+0x8] ;                                 /* 0x00000804020d7981 */
                                                                                        /* 0x000f62000c1e1900 */
    /*00c0*/                   IMAD.MOV.U32 R9, RZ, RZ, RZ ;                            /* 0x000000ffff097224 */
                                                                                        /* 0x000fe200078e00ff */
    /*00d0*/                   MOV R12, 0x8 ;                                           /* 0x00000008000c7802 */
                                                                                        /* 0x000fe20000000f00 */
    /*00e0*/                   IMAD.WIDE.U32 R6, R5, R6, RZ ;                           /* 0x0000000605067225 */
                                                                                        /* 0x004fca00078e00ff */
    /*00f0*/                   MOV R8, R7 ;                                             /* 0x0000000700087202 */
                                                                                        /* 0x000fe40000000f00 */
    /*0100*/                   SHF.L.U32 R7, R10, 0x1, RZ ;                             /* 0x000000010a077819 */
                                                                                        /* 0x000fe200000006ff */
    /*0110*/                   IMAD R15, R0, R5.reuse, RZ ;                             /* 0x00000005000f7224 */
                                                                                        /* 0x090fe400078e02ff */
    /*0120*/                   IMAD.WIDE.U32 R8, R11, R5, R8 ;                          /* 0x000000050b087225 */
                                                                                        /* 0x008fe200078e0008 */
    /*0130*/                   MOV R11, RZ ;                                            /* 0x000000ff000b7202 */
                                                                                        /* 0x000fc60000000f00 */
    /*0140*/                   IMAD.MOV.U32 R10, RZ, RZ, R9 ;                           /* 0x000000ffff0a7224 */
                                                                                        /* 0x000fc800078e0009 */
    /*0150*/                   IMAD.WIDE.U32 R10, R13, R5, R10 ;                        /* 0x000000050d0a7225 */
                                                                                        /* 0x020fc800078e000a */
    /*0160*/                   IMAD.WIDE.U32 R4, R7, R12, c[0x0][0x160] ;               /* 0x0000580007047625 */
                                                                                        /* 0x000fe200078e000c */
    /*0170*/                   IADD3 R11, R11, R15, RZ ;                                /* 0x0000000f0b0b7210 */
                                                                                        /* 0x000fe40007ffe0ff */
    /*0180*/                   MOV R7, R8 ;                                             /* 0x0000000800077202 */
                                                                                        /* 0x000fc60000000f00 */
    /*0190*/                   STG.E.64 [R4.64+0x8], R10 ;                              /* 0x0000080a04007986 */
                                                                                        /* 0x000fe8000c101b04 */
    /*01a0*/                   STG.E.64 [R4.64], R6 ;                                   /* 0x0000000604007986 */
                                                                                        /* 0x000fe2000c101b04 */
    /*01b0*/                   EXIT ;                                                   /* 0x000000000000794d */
                                                                                        /* 0x000fea0003800000 */
    /*01c0*/                   BRA 0x1c0;                                               /* 0xfffffff000007947 */
                                                                                        /* 0x000fc0000383ffff */
    /*01d0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01e0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01f0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0200*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0210*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0220*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0230*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0240*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0250*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0260*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0270*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
	..........


	Function : _Z24test6_parallel_imad_widePyPjS0_S_
.headerflags	@"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
    /*0000*/                   MOV R1, c[0x0][0x28] ;                                   /* 0x00000a0000017a02 */
                                                                                        /* 0x000fe40000000f00 */
    /*0010*/                   S2R R16, SR_TID.X ;                                      /* 0x0000000000107919 */
                                                                                        /* 0x000e220000002100 */
    /*0020*/                   MOV R5, 0x4 ;                                            /* 0x0000000400057802 */
                                                                                        /* 0x000fe20000000f00 */
    /*0030*/                   ULDC.64 UR4, c[0x0][0x118] ;                             /* 0x0000460000047ab9 */
                                                                                        /* 0x000fe20000000a00 */
    /*0040*/                   MOV R25, 0x8 ;                                           /* 0x0000000800197802 */
                                                                                        /* 0x000fe40000000f00 */
    /*0050*/                   SHF.L.U32 R16, R16, 0x2, RZ ;                            /* 0x0000000210107819 */
                                                                                        /* 0x001fca00000006ff */
    /*0060*/                   IMAD.WIDE.U32 R2, R16, R5, c[0x0][0x168] ;               /* 0x00005a0010027625 */
                                                                                        /* 0x000fc800078e0005 */
    /*0070*/                   IMAD.WIDE.U32 R4, R16.reuse, R5, c[0x0][0x170] ;         /* 0x00005c0010047625 */
                                                                                        /* 0x040fe200078e0005 */
    /*0080*/                   LDG.E R17, [R2.64] ;                                     /* 0x0000000402117981 */
                                                                                        /* 0x000ea6000c1e1900 */
    /*0090*/                   IMAD.WIDE.U32 R6, R16, R25, c[0x0][0x178] ;              /* 0x00005e0010067625 */
                                                                                        /* 0x000fe200078e0019 */
    /*00a0*/                   LDG.E R0, [R4.64] ;                                      /* 0x0000000404007981 */
                                                                                        /* 0x000ea8000c1e1900 */
    /*00b0*/                   LDG.E.64 R8, [R6.64] ;                                   /* 0x0000000406087981 */
                                                                                        /* 0x000ea8000c1e1b00 */
    /*00c0*/                   LDG.E R19, [R2.64+0x4] ;                                 /* 0x0000040402137981 */
                                                                                        /* 0x000ee8000c1e1900 */
    /*00d0*/                   LDG.E R21, [R2.64+0x8] ;                                 /* 0x0000080402157981 */
                                                                                        /* 0x000f28000c1e1900 */
    /*00e0*/                   LDG.E R23, [R2.64+0xc] ;                                 /* 0x00000c0402177981 */
                                                                                        /* 0x000f68000c1e1900 */
    /*00f0*/                   LDG.E R18, [R4.64+0x4] ;                                 /* 0x0000040404127981 */
                                                                                        /* 0x000ee8000c1e1900 */
    /*0100*/                   LDG.E R20, [R4.64+0x8] ;                                 /* 0x0000080404147981 */
                                                                                        /* 0x000f28000c1e1900 */
    /*0110*/                   LDG.E R22, [R4.64+0xc] ;                                 /* 0x00000c0404167981 */
                                                                                        /* 0x000f68000c1e1900 */
    /*0120*/                   LDG.E.64 R10, [R6.64+0x8] ;                              /* 0x00000804060a7981 */
                                                                                        /* 0x000ee8000c1e1b00 */
    /*0130*/                   LDG.E.64 R12, [R6.64+0x10] ;                             /* 0x00001004060c7981 */
                                                                                        /* 0x000f28000c1e1b00 */
    /*0140*/                   LDG.E.64 R14, [R6.64+0x18] ;                             /* 0x00001804060e7981 */
                                                                                        /* 0x000f62000c1e1b00 */
    /*0150*/                   IMAD.WIDE.U32 R8, R17, R0, R8 ;                          /* 0x0000000011087225 */
                                                                                        /* 0x004fc800078e0008 */
    /*0160*/                   IMAD.WIDE.U32 R16, R16, R25, c[0x0][0x160] ;             /* 0x0000580010107625 */
                                                                                        /* 0x000fca00078e0019 */
    /*0170*/                   STG.E.64 [R16.64], R8 ;                                  /* 0x0000000810007986 */
                                                                                        /* 0x000fe2000c101b04 */
    /*0180*/                   IMAD.WIDE.U32 R10, R19, R18, R10 ;                       /* 0x00000012130a7225 */
                                                                                        /* 0x008fc800078e000a */
    /*0190*/                   IMAD.WIDE.U32 R12, R21, R20, R12 ;                       /* 0x00000014150c7225 */
                                                                                        /* 0x010fc800078e000c */
    /*01a0*/                   IMAD.WIDE.U32 R14, R23, R22, R14 ;                       /* 0x00000016170e7225 */
                                                                                        /* 0x020fe200078e000e */
    /*01b0*/                   STG.E.64 [R16.64+0x8], R10 ;                             /* 0x0000080a10007986 */
                                                                                        /* 0x000fe8000c101b04 */
    /*01c0*/                   STG.E.64 [R16.64+0x10], R12 ;                            /* 0x0000100c10007986 */
                                                                                        /* 0x000fe8000c101b04 */
    /*01d0*/                   STG.E.64 [R16.64+0x18], R14 ;                            /* 0x0000180e10007986 */
                                                                                        /* 0x000fe2000c101b04 */
    /*01e0*/                   EXIT ;                                                   /* 0x000000000000794d */
                                                                                        /* 0x000fea0003800000 */
    /*01f0*/                   BRA 0x1f0;                                               /* 0xfffffff000007947 */
                                                                                        /* 0x000fc0000383ffff */
    /*0200*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0210*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0220*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0230*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0240*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0250*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0260*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0270*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
	..........


	Function : _Z20test5_64bit_via_widePyS_S_
.headerflags	@"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
    /*0000*/                   MOV R1, c[0x0][0x28] ;                                   /* 0x00000a0000017a02 */
                                                                                        /* 0x000fe40000000f00 */
    /*0010*/                   S2R R12, SR_TID.X ;                                      /* 0x00000000000c7919 */
                                                                                        /* 0x000e220000002100 */
    /*0020*/                   MOV R13, 0x8 ;                                           /* 0x00000008000d7802 */
                                                                                        /* 0x000fe20000000f00 */
    /*0030*/                   ULDC.64 UR4, c[0x0][0x118] ;                             /* 0x0000460000047ab9 */
                                                                                        /* 0x000fc80000000a00 */
    /*0040*/                   IMAD.WIDE.U32 R2, R12, R13, c[0x0][0x168] ;              /* 0x00005a000c027625 */
                                                                                        /* 0x001fc800078e000d */
    /*0050*/                   IMAD.WIDE.U32 R4, R12.reuse, R13, c[0x0][0x170] ;        /* 0x00005c000c047625 */
                                                                                        /* 0x040fe400078e000d */
    /*0060*/                   LDG.E.64 R2, [R2.64] ;                                   /* 0x0000000402027981 */
                                                                                        /* 0x000ea8000c1e1b00 */
    /*0070*/                   LDG.E.64 R4, [R4.64] ;                                   /* 0x0000000404047981 */
                                                                                        /* 0x000ea2000c1e1b00 */
    /*0080*/                   SHF.L.U32 R12, R12, 0x1, RZ ;                            /* 0x000000010c0c7819 */
                                                                                        /* 0x000fca00000006ff */
    /*0090*/                   IMAD.WIDE.U32 R12, R12, R13, c[0x0][0x160] ;             /* 0x000058000c0c7625 */
                                                                                        /* 0x000fc800078e000d */
    /*00a0*/                   IMAD.WIDE.U32 R8, R3, R4, RZ ;                           /* 0x0000000403087225 */
                                                                                        /* 0x004fc800078e00ff */
    /*00b0*/                   IMAD.WIDE.U32 R6, R2, R4, RZ ;                           /* 0x0000000402067225 */
                                                                                        /* 0x000fc800078e00ff */
    /*00c0*/                   IMAD.WIDE.U32 R10, R2, R5, R8 ;                          /* 0x00000005020a7225 */
                                                                                        /* 0x000fc800078e0008 */
    /*00d0*/                   IMAD.WIDE.U32 R8, R3, R5, RZ ;                           /* 0x0000000503087225 */
                                                                                        /* 0x000fe200078e00ff */
    /*00e0*/                   IADD3 R7, P0, R7, R10, RZ ;                              /* 0x0000000a07077210 */
                                                                                        /* 0x000fc80007f1e0ff */
    /*00f0*/                   IADD3.X R10, P0, R8, R11, RZ, P0, !PT ;                  /* 0x0000000b080a7210 */
                                                                                        /* 0x000fe2000071e4ff */
    /*0100*/                   STG.E.64 [R12.64], R6 ;                                  /* 0x000000060c007986 */
                                                                                        /* 0x000fe6000c101b04 */
    /*0110*/                   IADD3.X R11, RZ, R9, RZ, P0, !PT ;                       /* 0x00000009ff0b7210 */
                                                                                        /* 0x000fca00007fe4ff */
    /*0120*/                   STG.E.64 [R12.64+0x8], R10 ;                             /* 0x0000080a0c007986 */
                                                                                        /* 0x000fe2000c101b04 */
    /*0130*/                   EXIT ;                                                   /* 0x000000000000794d */
                                                                                        /* 0x000fea0003800000 */
    /*0140*/                   BRA 0x140;                                               /* 0xfffffff000007947 */
                                                                                        /* 0x000fc0000383ffff */
    /*0150*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0160*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0170*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0180*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0190*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01a0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01b0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01c0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01d0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01e0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01f0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
	..........


	Function : _Z18test4_chain_4_widePyPjS0_
.headerflags	@"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
    /*0000*/                   MOV R1, c[0x0][0x28] ;                                   /* 0x00000a0000017a02 */
                                                                                        /* 0x000fe40000000f00 */
    /*0010*/                   S2R R10, SR_TID.X ;                                      /* 0x00000000000a7919 */
                                                                                        /* 0x000e220000002100 */
    /*0020*/                   MOV R3, 0x4 ;                                            /* 0x0000000400037802 */
                                                                                        /* 0x000fe20000000f00 */
    /*0030*/                   ULDC.64 UR4, c[0x0][0x118] ;                             /* 0x0000460000047ab9 */
                                                                                        /* 0x000fe20000000a00 */
    /*0040*/                   SHF.L.U32 R2, R10, 0x2, RZ ;                             /* 0x000000020a027819 */
                                                                                        /* 0x001fc600000006ff */
    /*0050*/                   IMAD.WIDE.U32 R4, R10, R3, c[0x0][0x170] ;               /* 0x00005c000a047625 */
                                                                                        /* 0x000fc800078e0003 */
    /*0060*/                   IMAD.WIDE.U32 R2, R2, R3, c[0x0][0x168] ;                /* 0x00005a0002027625 */
                                                                                        /* 0x000fe400078e0003 */
    /*0070*/                   LDG.E R5, [R4.64] ;                                      /* 0x0000000404057981 */
                                                                                        /* 0x000ea8000c1e1900 */
    /*0080*/                   LDG.E R6, [R2.64] ;                                      /* 0x0000000402067981 */
                                                                                        /* 0x000ea8000c1e1900 */
    /*0090*/                   LDG.E R8, [R2.64+0x4] ;                                  /* 0x0000040402087981 */
                                                                                        /* 0x000ee8000c1e1900 */
    /*00a0*/                   LDG.E R13, [R2.64+0x8] ;                                 /* 0x00000804020d7981 */
                                                                                        /* 0x000f22000c1e1900 */
    /*00b0*/                   MOV R11, 0x8 ;                                           /* 0x00000008000b7802 */
                                                                                        /* 0x000fc40000000f00 */
    /*00c0*/                   SHF.L.U32 R10, R10, 0x1, RZ ;                            /* 0x000000010a0a7819 */
                                                                                        /* 0x000fca00000006ff */
    /*00d0*/                   IMAD.WIDE.U32 R10, R10, R11, c[0x0][0x160] ;             /* 0x000058000a0a7625 */
                                                                                        /* 0x000fc800078e000b */
    /*00e0*/                   IMAD.WIDE.U32 R6, R6, R5, RZ ;                           /* 0x0000000506067225 */
                                                                                        /* 0x004fc800078e00ff */
    /*00f0*/                   IMAD.WIDE.U32 R8, R8, R5, RZ ;                           /* 0x0000000508087225 */
                                                                                        /* 0x008fca00078e00ff */
    /*0100*/                   IADD3 R7, P0, R7, R8, RZ ;                               /* 0x0000000807077210 */
                                                                                        /* 0x000fc80007f1e0ff */
    /*0110*/                   IADD3.X R8, RZ, R9, RZ, P0, !PT ;                        /* 0x00000009ff087210 */
                                                                                        /* 0x000fe200007fe4ff */
    /*0120*/                   STG.E.64 [R10.64], R6 ;                                  /* 0x000000060a007986 */
                                                                                        /* 0x000fe2000c101b04 */
    /*0130*/                   MOV R9, RZ ;                                             /* 0x000000ff00097202 */
                                                                                        /* 0x000fca0000000f00 */
    /*0140*/                   IMAD.WIDE.U32 R8, R13, R5, R8 ;                          /* 0x000000050d087225 */
                                                                                        /* 0x010fca00078e0008 */
    /*0150*/                   STG.E.64 [R10.64+0x8], R8 ;                              /* 0x000008080a007986 */
                                                                                        /* 0x000fe2000c101b04 */
    /*0160*/                   EXIT ;                                                   /* 0x000000000000794d */
                                                                                        /* 0x000fea0003800000 */
    /*0170*/                   BRA 0x170;                                               /* 0xfffffff000007947 */
                                                                                        /* 0x000fc0000383ffff */
    /*0180*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0190*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01a0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01b0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01c0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01d0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01e0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01f0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
	..........


	Function : _Z16test3_cpp_nativePyPjS0_S_
.headerflags	@"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
    /*0000*/                   MOV R1, c[0x0][0x28] ;                                   /* 0x00000a0000017a02 */
                                                                                        /* 0x000fe40000000f00 */
    /*0010*/                   S2R R10, SR_TID.X ;                                      /* 0x00000000000a7919 */
                                                                                        /* 0x000e220000002100 */
    /*0020*/                   MOV R5, 0x4 ;                                            /* 0x0000000400057802 */
                                                                                        /* 0x000fe20000000f00 */
    /*0030*/                   ULDC.64 UR4, c[0x0][0x118] ;                             /* 0x0000460000047ab9 */
                                                                                        /* 0x000fe20000000a00 */
    /*0040*/                   MOV R11, 0x8 ;                                           /* 0x00000008000b7802 */
                                                                                        /* 0x000fc60000000f00 */
    /*0050*/                   IMAD.WIDE.U32 R2, R10, R5, c[0x0][0x168] ;               /* 0x00005a000a027625 */
                                                                                        /* 0x001fc800078e0005 */
    /*0060*/                   IMAD.WIDE.U32 R4, R10.reuse, R5, c[0x0][0x170] ;         /* 0x00005c000a047625 */
                                                                                        /* 0x040fe400078e0005 */
    /*0070*/                   LDG.E R3, [R2.64] ;                                      /* 0x0000000402037981 */
                                                                                        /* 0x000ea4000c1e1900 */
    /*0080*/                   IMAD.WIDE.U32 R6, R10.reuse, R11.reuse, c[0x0][0x178] ;  /* 0x00005e000a067625 */
                                                                                        /* 0x0c0fe400078e000b */
    /*0090*/                   LDG.E R4, [R4.64] ;                                      /* 0x0000000404047981 */
                                                                                        /* 0x000ea8000c1e1900 */
    /*00a0*/                   LDG.E.64 R6, [R6.64] ;                                   /* 0x0000000406067981 */
                                                                                        /* 0x000ea2000c1e1b00 */
    /*00b0*/                   IMAD.WIDE.U32 R10, R10, R11, c[0x0][0x160] ;             /* 0x000058000a0a7625 */
                                                                                        /* 0x000fc800078e000b */
    /*00c0*/                   IMAD.WIDE.U32 R8, R3, R4, R6 ;                           /* 0x0000000403087225 */
                                                                                        /* 0x004fca00078e0006 */
    /*00d0*/                   STG.E.64 [R10.64], R8 ;                                  /* 0x000000080a007986 */
                                                                                        /* 0x000fe2000c101b04 */
    /*00e0*/                   EXIT ;                                                   /* 0x000000000000794d */
                                                                                        /* 0x000fea0003800000 */
    /*00f0*/                   BRA 0xf0;                                                /* 0xfffffff000007947 */
                                                                                        /* 0x000fc0000383ffff */
    /*0100*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0110*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0120*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0130*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0140*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0150*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0160*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0170*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
	..........


	Function : _Z14test2_mad_widePyPjS0_S_
.headerflags	@"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
    /*0000*/                   MOV R1, c[0x0][0x28] ;                                   /* 0x00000a0000017a02 */
                                                                                        /* 0x000fe40000000f00 */
    /*0010*/                   S2R R0, SR_TID.X ;                                       /* 0x0000000000007919 */
                                                                                        /* 0x000e220000002100 */
    /*0020*/                   MOV R5, 0x4 ;                                            /* 0x0000000400057802 */
                                                                                        /* 0x000fe20000000f00 */
    /*0030*/                   ULDC.64 UR4, c[0x0][0x118] ;                             /* 0x0000460000047ab9 */
                                                                                        /* 0x000fe20000000a00 */
    /*0040*/                   MOV R13, 0x8 ;                                           /* 0x00000008000d7802 */
                                                                                        /* 0x000fc60000000f00 */
    /*0050*/                   IMAD.WIDE.U32 R2, R0, R5, c[0x0][0x168] ;                /* 0x00005a0000027625 */
                                                                                        /* 0x001fc800078e0005 */
    /*0060*/                   IMAD.WIDE.U32 R4, R0.reuse, R5, c[0x0][0x170] ;          /* 0x00005c0000047625 */
                                                                                        /* 0x040fe400078e0005 */
    /*0070*/                   LDG.E R2, [R2.64] ;                                      /* 0x0000000402027981 */
                                                                                        /* 0x000ea4000c1e1900 */
    /*0080*/                   IMAD.WIDE.U32 R6, R0, R13, c[0x0][0x178] ;               /* 0x00005e0000067625 */
                                                                                        /* 0x000fe400078e000d */
    /*0090*/                   LDG.E R5, [R4.64] ;                                      /* 0x0000000404057981 */
                                                                                        /* 0x000ea8000c1e1900 */
    /*00a0*/                   LDG.E.64 R6, [R6.64] ;                                   /* 0x0000000406067981 */
                                                                                        /* 0x000ee2000c1e1b00 */
    /*00b0*/                   IMAD.WIDE.U32 R8, R2, R5, RZ ;                           /* 0x0000000502087225 */
                                                                                        /* 0x004fca00078e00ff */
    /*00c0*/                   IADD3 R10, P0, R6, R8, RZ ;                              /* 0x00000008060a7210 */
                                                                                        /* 0x008fc80007f1e0ff */
    /*00d0*/                   IADD3.X R11, R7, R9, RZ, P0, !PT ;                       /* 0x00000009070b7210 */
                                                                                        /* 0x000fe200007fe4ff */
    /*00e0*/                   IMAD.WIDE.U32 R8, R0, R13, c[0x0][0x160] ;               /* 0x0000580000087625 */
                                                                                        /* 0x000fca00078e000d */
    /*00f0*/                   STG.E.64 [R8.64], R10 ;                                  /* 0x0000000a08007986 */
                                                                                        /* 0x000fe2000c101b04 */
    /*0100*/                   EXIT ;                                                   /* 0x000000000000794d */
                                                                                        /* 0x000fea0003800000 */
    /*0110*/                   BRA 0x110;                                               /* 0xfffffff000007947 */
                                                                                        /* 0x000fc0000383ffff */
    /*0120*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0130*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0140*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0150*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0160*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0170*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0180*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0190*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01a0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01b0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01c0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01d0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01e0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*01f0*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
	..........


	Function : _Z18test1_wide_add_u64PyPjS0_S_
.headerflags	@"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM89 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM89)"
    /*0000*/                   MOV R1, c[0x0][0x28] ;                                   /* 0x00000a0000017a02 */
                                                                                        /* 0x000fe40000000f00 */
    /*0010*/                   S2R R10, SR_TID.X ;                                      /* 0x00000000000a7919 */
                                                                                        /* 0x000e220000002100 */
    /*0020*/                   MOV R5, 0x4 ;                                            /* 0x0000000400057802 */
                                                                                        /* 0x000fe20000000f00 */
    /*0030*/                   ULDC.64 UR4, c[0x0][0x118] ;                             /* 0x0000460000047ab9 */
                                                                                        /* 0x000fe20000000a00 */
    /*0040*/                   MOV R11, 0x8 ;                                           /* 0x00000008000b7802 */
                                                                                        /* 0x000fc60000000f00 */
    /*0050*/                   IMAD.WIDE.U32 R2, R10, R5, c[0x0][0x168] ;               /* 0x00005a000a027625 */
                                                                                        /* 0x001fc800078e0005 */
    /*0060*/                   IMAD.WIDE.U32 R4, R10.reuse, R5, c[0x0][0x170] ;         /* 0x00005c000a047625 */
                                                                                        /* 0x040fe400078e0005 */
    /*0070*/                   LDG.E R3, [R2.64] ;                                      /* 0x0000000402037981 */
                                                                                        /* 0x000ea4000c1e1900 */
    /*0080*/                   IMAD.WIDE.U32 R6, R10.reuse, R11.reuse, c[0x0][0x178] ;  /* 0x00005e000a067625 */
                                                                                        /* 0x0c0fe400078e000b */
    /*0090*/                   LDG.E R4, [R4.64] ;                                      /* 0x0000000404047981 */
                                                                                        /* 0x000ea8000c1e1900 */
    /*00a0*/                   LDG.E.64 R6, [R6.64] ;                                   /* 0x0000000406067981 */
                                                                                        /* 0x000ea2000c1e1b00 */
    /*00b0*/                   IMAD.WIDE.U32 R10, R10, R11, c[0x0][0x160] ;             /* 0x000058000a0a7625 */
                                                                                        /* 0x000fc800078e000b */
    /*00c0*/                   IMAD.WIDE.U32 R8, R3, R4, R6 ;                           /* 0x0000000403087225 */
                                                                                        /* 0x004fca00078e0006 */
    /*00d0*/                   STG.E.64 [R10.64], R8 ;                                  /* 0x000000080a007986 */
                                                                                        /* 0x000fe2000c101b04 */
    /*00e0*/                   EXIT ;                                                   /* 0x000000000000794d */
                                                                                        /* 0x000fea0003800000 */
    /*00f0*/                   BRA 0xf0;                                                /* 0xfffffff000007947 */
                                                                                        /* 0x000fc0000383ffff */
    /*0100*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0110*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0120*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0130*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0140*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0150*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0160*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
    /*0170*/                   NOP;                                                     /* 0x0000000000007918 */
                                                                                        /* 0x000fc00000000000 */
	..........

I do not have experience with the IMAD.WIDE approach, so I am afraid I am also still learning. To get my feet wet, I programmed up different approaches for a 64 x 64 → 128 bit multiply, and I think I have found one that is “optimal”. For a baseline implementation, 128 x 128 → 256 bit and 256 x 256 → 512 bit multiplies can be constructed at HLL level from this primitive in a straightforward manner. I am working on my test scaffolding for that at the moment.

In my view, a more suitable term for trial and error is experimentation, and as any researcher in the natural sciences would confirm, this is how we gain a deeper understanding of a given issue. The process of experimentation also provides for the occasional serendipitous discovery and can spark innovative solutions that depart from conventional wisdom. Challenging conventional wisdom is a good habit, even though in my experience conventional wisdom comes out on top more than half of the time.