Question about atomic-add sass

King_Crimson · June 21, 2018, 12:56am

Dear experts,
I have a couple of questions on the machine code for atomic-add on the Pascal GPU. The code:

__global__ void Test(double* biubiu)
{
    double x = 1.23;
    atomicAdd(biubiu, x);
}

is compiled into:

code for sm_61
		Function : _Z4TestPd
	.headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                               /* 0x001fc400fe2007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];       /* 0x4c98078000870001 */
        /*0010*/                   MOV R2, c[0x0][0x140];      /* 0x4c98078005070002 */
        /*0018*/                   MOV R3, c[0x0][0x144];      /* 0x4c98078005170003 */
                                                               /* 0x0003c400fe4007f1 */
        /*0028*/                   MOV32I R4, 0x7ae147ae;      /* 0x0107ae147ae7f004 */
        /*0030*/                   MOV32I R5, 0x3ff3ae14;      /* 0x0103ff3ae147f005 */
        /*0038*/                   RED.E.ADD.F64.RN [R2], R4;  /* 0xebf9000000670204 */
                                                               /* 0x001f9c00fde007ef */
        /*0048*/                   NOP;                        /* 0x50b0000000070f00 */
        /*0050*/                   NOP;                        /* 0x50b0000000070f00 */
        /*0058*/                   NOP;                        /* 0x50b0000000070f00 */
                                                               /* 0x001f8000ffe007ff */
        /*0068*/                   EXIT;                       /* 0xe30000000007000f */
        /*0070*/                   BRA 0x70;                   /* 0xe2400fffff87000f */
        /*0078*/                   NOP;                        /* 0x50b0000000070f00 */

My question is:

According to cuda binary utility manual, MOV32I has existed since Volta, but I’m using the Pascal GPU. Is this simply an error in the manual?
Does “E” in RED.E refer to “exclusive” in the cache protocol?
What does MOV32I mean exactly? 32: 32-bit, I: integer? Here I was using double-precision atomic-add. So why does it only move 32-bit data into R4 and then atomically add R4 to R2?

Thanks for your time!

njuffa · June 21, 2018, 1:43am

As GPUs are 32-bit machines they use aligned register pairs when operating on 64-bit data, meaning the low-order portion of the data is stored in an even-numbered register, and the high-order portion of the data in the next higher odd-numbered register. NVIDIA’s disassembler shows only the even-numbered register of an aligned register pair for 64-bit operations.

In this example, data is loaded into R4:R5 using two 32-bit binary moves. MOV32I is a 32-bit immediate move instruction meaning the data stored into the destination register is contained in the instruction itself (this is possible since GPU instructions are eight byte in size). I am reasonably sure that such an instruction has existed since sm_10.

Note that the RED instruction does not store into R2:R3; rather R2:R3 hold a 64-bit pointer (namely, ‘biubiu’), and the instruction stores into the 64-bit memory location pointed to by that pointer.

If you disassemble more SASS code, you will find that the .E suffix is used with global memory accesses in general. I don’t quite recall what it stands for, it might indicate that generic addresses are being used, as opposed to memory-space specific ones (see the CUDA documention for generic address space vs specific memory address spaces, an example of which would be (thread-) local memory). Or it might indicate that this instruction uses 64-bit addressing (‘E’ as in “extended”?).

system · March 19, 2022, 4:06am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Half2 atomics generate unused code CUDA Programming and Performance	13	161	August 8, 2024
[Solved]SASS Code Analysis CUDA Programming and Performance	5	8044	November 30, 2017
About LD instruction for wmma CUDA Programming and Performance	2	514	July 5, 2023
atomicAdd with float2 no API support, workarounds ? CUDA Programming and Performance	23	5168	January 28, 2021
Bug with Pointer Arithmetic EmuDebug/EmuRelease results donâ€™t match debug/Release Results CUDA Programming and Performance	9	1512	August 4, 2009
Identifier "__HALF2_TO_UI" is undefined when using asm for cuda CUDA Programming and Performance cuda	18	516	October 27, 2023
How to eliminate address computation instructions in the SASS code CUDA Programming and Performance	5	684	October 3, 2020
Is there a way to avoid atomicAdd in my situation? CUDA Programming and Performance	3	1391	March 4, 2019
Compute Capability 1.1 32-bit words?? CUDA Programming and Performance	3	9609	February 15, 2010
Shared mem atomics Repeat topic CUDA Programming and Performance	47	8781	December 1, 2009

Question about atomic-add sass

Related topics