Hi everyone!
I am optimizing an application that is almost completely relying on multi-precision integer arithmetic. The main target of the optimization is a Tesla P100. Due to the lack of native 32-bit multiply-and-add instructions, i am trying to speed up the routines with the help of 16-bit XMAD, as explained in the forum (https://devtalk.nvidia.com/default/topic/1017754/long-integer-multiplication-mul-wide-u64-and-mul-wide-u128/) and also in the paper by Emmert et al. (https://ieeexplore.ieee.org/abstract/document/7563271).
To make ptxas generate XMAD instructions, i am using their template, e.g.
#define xmadhh_c_cc(r, a, b, c) \
asm volatile ("{ \n\t" \
".reg .u16 %alo, %ahi, %blo, %bhi; \n\t" \
".reg .u32 %t; \n\t" \
"mov.b32 {%alo, %ahi}, %1; \n\t" \
"mov.b32 {%blo, %bhi}, %2; \n\t" \
"mul.wide.u16 %t, %ahi, %bhi; \n\t" \
"addc.cc.u32 %0, %3, %t; \n\t" \
"}" : "=r"(r) : "r" (a), "r" (b), "r" (c));
However, if i look into the generated SASS code for my platform, i see that for each variant (low/low, high/high, high/low, with and without carry) there is one XMAD and one IADD instruction generated. I never see any XMAD.X or XMAD.CC in the SASS code.
Can anyone with more insight into the compile-steps help me on this one?
Any help is much appreciated!
UPDATE:
In case this is important, i am using
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130