The meaning of CUDA disassemly

Hi everyone, I’m a CUDA assembly beginner and I disassemble the code of vectorAdd.cu

__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        C[i] = A[i] + B[i];
    }
}

the corresponding disassembly:

code for sm_30
		Function : _Z9vectorAddPKfS0_Pfi
	.headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
                                                                                /* 0x2202e2c282823307 */
        /*0008*/                   MOV R1, c[0x0][0x44];                        /* 0x2800400110005de4 */
        /*0010*/                   S2R R0, SR_CTAID.X;                          /* 0x2c00000094001c04 */
        /*0018*/                   S2R R3, SR_TID.X;                            /* 0x2c0000008400dc04 */
        /*0020*/                   IMAD R0, R0, c[0x0][0x28], R3;               /* 0x20064000a0001ca3 */
        /*0028*/                   ISETP.GE.AND P0, PT, R0, c[0x0][0x158], PT;  /* 0x1b0e40056001dc23 */
        /*0030*/               @P0 EXIT;                                        /* 0x80000000000001e7 */
        /*0038*/                   ISCADD R4.CC, R0, c[0x0][0x140], 0x2;        /* 0x4001400500011c43 */
                                                                                /* 0x22c04282c04282b7 */
        /*0048*/                   MOV32I R7, 0x4;                              /* 0x180000001001dde2 */
        /*0050*/                   IMAD.HI.X R5, R0, R7, c[0x0][0x144];         /* 0x208e800510015ce3 */
        /*0058*/                   ISCADD R2.CC, R0, c[0x0][0x148], 0x2;        /* 0x4001400520009c43 */
        /*0060*/                   LD.E R4, [R4];                               /* 0x8400000000411c85 */
        /*0068*/                   IMAD.HI.X R3, R0, R7, c[0x0][0x14c];         /* 0x208e80053000dce3 */
        /*0070*/                   LD.E R2, [R2];                               /* 0x8400000000209c85 */
        /*0078*/                   ISCADD R6.CC, R0, c[0x0][0x150], 0x2;        /* 0x4001400540019c43 */
                                                                                /* 0x20000002f04283f7 */
        /*0088*/                   IMAD.HI.X R7, R0, R7, c[0x0][0x154];         /* 0x208e80055001dce3 */
        /*0090*/                   FADD R0, R2, R4;                             /* 0x5000000010201c00 */
        /*0098*/                   ST.E [R6], R0;                               /* 0x9400000000601c85 */
        /*00a0*/                   EXIT;                                        /* 0x8000000000001de7 */
        /*00a8*/                   BRA 0xa8;                                    /* 0x4003ffffe0001de7 */
        /*00b0*/                   NOP;                                         /* 0x4000000000001de4 */
        /*00b8*/                   NOP;                                         /* 0x4000000000001de4 */
  1. what’s the meaning of c[0x0][0x44], c[0x0][0x28], c[0x0][0x140], c[0x0][0x144], c[0x0][0x148], c[0x0][14c], c[0x0][0x150], c[0x0][0x154]?

  2. How can I check the value of c[0x0][???] by cuda-gdb or other methods?

Thank you~!

it’s SASS code, not officially documented by nVidia. The implementation details change between hardware generations (Fermi, Kepler, Maxwell, Pascal, Volta, …)

The CUDA utility cuobjdump -sass can dump the SASS code from a cuda binary object

There’s an unofficial SASS assembler for the Maxwell architecture written by Scott Gray

I do not know how to get low level data access to machine registers at SASS level, unfortunately.

Christian

Those are referring to locations in constant memory. Constant memory has some partitions. Constant memory is used to provide kernel arguments to the device code.

Thank you for your answer, and how can I check the these variables memory by cuda-gdb?

Thanks very much! I know the maxas, but its too difficult to me now~!

I don’t understand why it’s doing:

  1. IMAD.HI.X R5, R0, R7, c[0x0][0x144];

It never uses R5 after this

Same for R3 here:

  1. IMAD.HI.X R3, R0, R7, c[0x0][0x14c];

In both cases, I think it’s doing an i (the ‘global index’) * 4 + some base address and the result is stored in R5/R3 and then does nothing with the result?

The IMAD.HI.X instructions storing results in R3 and R5 would still make sense (despite the output registers remaining unused) if the following ISCADD R*.CC instructions makes use the carry bit created by these multiplications.

Christian

CONSTANTS

CUDA SASS (Disassembly) is close to PTX.

c[bank][offset] is the syntax for a reference to an indexed constants. Indexed constants are heavily used to reference

  1. Per module constant variables
  2. Per module constant literals (const double = 1.0) that cannot be encoded directly into instructions
  3. Per launch user kernel parameters (up to 4KB)
  4. Per launch driver kernel parameters (local memory base address, GridDim, BlockDim)

The bank for module level constants will be different from per kernel launch constants.

The banks and offsets will differ between architectures/chips.

DEBUGGER - READING CONSTANTS

The Windows CUDA debugger can read module constants (bank=0, c[0][#]) in the memory view using the syntax
(constant int*)0. For c[0][0x100] use (constant int*)0x100.

The Windows CUDA debugger can view the 4KiB of kernel parameters using (params int*)0. This maps to c[3][0x140] or c[3][0x160] depending on the architecture.

I have filed a Request For Enhancement with the debugger teams.

cuda-gdb does not appear to support reading constants.

64-bit ADDRESSES

The example you specified IMAD.HI.X R5, R0, R7, c[0x0][0x144]; is doing the operation

R5 = R0 x R7 + c[0][0x144]

This is forming the 64-bit address (R5 << 32) + R4. In SASS 64-bit addresses are referenced by the lower register. For example you are likely to see

LDG.E.32 R0, [R4] // load 32-bit from extended address R4/R5 (extended meaning 64-bit address vs. 32-bit address)

That certainly applies, although 1.0 isn’t the best example, as that can be encoded into instructions in many cases. const double pi = 3.14159265358979323 would be a more suitable example.

#include <stdio.h>
#include <stdlib.h>

__global__ void kernel (double x)
{
    cont double one = 1.0;
    x = x + one;
    x = fma (x, one, x);
    if (x > one) {
        x = x - one;
    }
    printf ("x=%23.15e\n", x);
}

int main (void)
{
    kernel<<<1,1>>>(0.1);
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

arch = sm_61
code version = [1,7]
producer = cuda
host = windows
compile_size = 64bit

        code for sm_61
                Function : _Z6kerneld
        .headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"

        /*0008*/                   MOV R1, c[0x0][0x20];
        /*0010*/                   MOV32I R2, 0x0;
        /*0018*/                   MOV32I R3, 0x3ff00000;

        /*0028*/                   DADD R2, R2, c[0x0][0x140];
        /*0030*/                   DFMA R2, R2, 1, R2;                     <<<<<
        /*0038*/                   IADD32I R1, R1, -0x8;

        /*0048*/                   DSETP.GT.AND P0, PT, R2.reuse, 1, PT;   <<<<<
        /*0050*/                   DADD R4, R2, -1;                        <<<<<
        /*0058*/                   IADD R6.CC, R1, c[0x0][0x4];

        /*0068*/                   IADD R0, R6, -c[0x0][0x4];
        /*0070*/                   IADD.X R7, RZ, c[0x0][0x104];
        /*0078*/                   SEL R2, R4, R2, P0;

        /*0088*/                   SEL R3, R5, R3, P0;
        /*0090*/         {         MOV32I R4, 0x0;
        /*0098*/                   STL.64 [R0], R2;        }

        /*00a8*/                   MOV32I R5, 0x0;
        /*00b0*/                   JCAL 0x0;
        /*00b8*/                   EXIT;

        /*00c8*/                   BRA 0xc0;