Incomprehensible IMAD.HI behavior

Hi guys, I’m using 3080 with cuda11.6 do some s32 divide, I found some of the result is incomprehensible, I open it with cuda-gdb and piece of assembly is like:
/some other calculate/
IMAD R11, R8, R9, RZ
IMAD.HI.U32 R7, R7, R11, R6
in this case, R6 is 0, R8 is 0xfffffe01, R9 is 0x800000 so R11 is 0x800000, which is expected. But when I do IMAD.HI.U32 with input R7 = 0x1ff, I got result R7 = 0x1ff which is unexpected. I know IMAD.HI is mul&add with Hi(R7*R11)+R6, but I think the true high part is 0, who can please tell me how I got result with 0x1ff?

Here is some information about the log. Who can please help me with this question?

I’m not able to explain the picture. to investigate, I created the following test case:

$ cat
#include <cstdio>
__global__ void k(unsigned a, unsigned b, unsigned c) {
    asm ("mad.hi.u32 %0, %0, %1, %2;" : "+r"(a) : "r"(b) , "r"(c));
    printf("result: 0x%x\n", a);

int main(){

  k<<<1,1>>>(0x01ffU, 0x0800000U, 0U);
$ nvcc -o t2024
$ ./t2024
result: 0x0

As you can see it produces the expected result. If you compile the above code with -arch=sm_86 and then do cuobjdump -sass ./t2024 you will see that there is a IMAD.HI.U32 instruction in the dump.

If you can provide a similar test case that shows an unexpected result, it may be helpful.

Thanks for help! I got same result on my chip. But still didn’t solve my problem.
In my condition this is a small int32 divide kernel:
__global__ void int32_div(int in1, int in2, int* output) { *output = in1 / in2; }
when I set in1 = 0x3f800001 and in2 = 0xff800000, I saw some instruction in cuda-gdb, after i run si 23 and got instructions above, at this time I can see $R7 is 0x1ff and $R11 is 0x800000, but I couldn’t understand why when I run IMAD.HI.U32 R7, R7, R11, R6 by si and got 0x1ff as $R7 result.
BTW, in this case I saw an instruction like IMAD.MOV.U32 R6, RZ, RZ, RZ, why instruction did addition with R6 but not RZ?

When I run your code the way you describe, I seem to get the right result:

$ cat
#include <iostream>

__global__ void int32_div(int in1, int in2, int* output) { *output = in1 / in2; };

int main(){

  int *o;
  cudaMalloc(&o, sizeof (o[0]));
  int in1 = 0x3f800001;
  int in2 = 0xff800000;
  int32_div<<<1,1>>>(in1, in2, o);
  int r;
  cudaMemcpy(&r, o, sizeof(o[0]), cudaMemcpyDeviceToHost);
  std::cout << "r = " << r << " should be: " << in1/in2 << std::endl;
$ nvcc -o t2026
$ compute-sanitizer ./t2026
r = -127 should be: -127
========= ERROR SUMMARY: 0 errors

Do you get the right result?

Yes, I got the right result, but couldn’t understand why this IMAD.HI.U32 instruction got result like this.

I don’t see how it could be anything except an issue with the cuda-gdb tool. You might want to ask on that forum.

I don’t have a 3080 to test on.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.