question about atomic-add sass

Dear experts,
I have a couple of questions on the machine code for atomic-add on the Pascal GPU. The code:

__global__ void Test(double* biubiu)
{
    double x = 1.23;
    atomicAdd(biubiu, x);
}

is compiled into:

code for sm_61
		Function : _Z4TestPd
	.headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                               /* 0x001fc400fe2007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];       /* 0x4c98078000870001 */
        /*0010*/                   MOV R2, c[0x0][0x140];      /* 0x4c98078005070002 */
        /*0018*/                   MOV R3, c[0x0][0x144];      /* 0x4c98078005170003 */
                                                               /* 0x0003c400fe4007f1 */
        /*0028*/                   MOV32I R4, 0x7ae147ae;      /* 0x0107ae147ae7f004 */
        /*0030*/                   MOV32I R5, 0x3ff3ae14;      /* 0x0103ff3ae147f005 */
        /*0038*/                   RED.E.ADD.F64.RN [R2], R4;  /* 0xebf9000000670204 */
                                                               /* 0x001f9c00fde007ef */
        /*0048*/                   NOP;                        /* 0x50b0000000070f00 */
        /*0050*/                   NOP;                        /* 0x50b0000000070f00 */
        /*0058*/                   NOP;                        /* 0x50b0000000070f00 */
                                                               /* 0x001f8000ffe007ff */
        /*0068*/                   EXIT;                       /* 0xe30000000007000f */
        /*0070*/                   BRA 0x70;                   /* 0xe2400fffff87000f */
        /*0078*/                   NOP;                        /* 0x50b0000000070f00 */

My question is:

  1. According to cuda binary utility manual, MOV32I has existed since Volta, but I’m using the Pascal GPU. Is this simply an error in the manual?
  2. Does “E” in RED.E refer to “exclusive” in the cache protocol?
  3. What does MOV32I mean exactly? 32: 32-bit, I: integer? Here I was using double-precision atomic-add. So why does it only move 32-bit data into R4 and then atomically add R4 to R2?

Thanks for your time!

As GPUs are 32-bit machines they use aligned register pairs when operating on 64-bit data, meaning the low-order portion of the data is stored in an even-numbered register, and the high-order portion of the data in the next higher odd-numbered register. NVIDIA’s disassembler shows only the even-numbered register of an aligned register pair for 64-bit operations.

In this example, data is loaded into R4:R5 using two 32-bit binary moves. MOV32I is a 32-bit immediate move instruction meaning the data stored into the destination register is contained in the instruction itself (this is possible since GPU instructions are eight byte in size). I am reasonably sure that such an instruction has existed since sm_10.

Note that the RED instruction does not store into R2:R3; rather R2:R3 hold a 64-bit pointer (namely, ‘biubiu’), and the instruction stores into the 64-bit memory location pointed to by that pointer.

If you disassemble more SASS code, you will find that the .E suffix is used with global memory accesses in general. I don’t quite recall what it stands for, it might indicate that generic addresses are being used, as opposed to memory-space specific ones (see the CUDA documention for generic address space vs specific memory address spaces, an example of which would be (thread-) local memory). Or it might indicate that this instruction uses 64-bit addressing (‘E’ as in “extended”?).