The meaning of CUDA disassemly

sian1987o · May 3, 2018, 1:37pm

Hi everyone, I’m a CUDA assembly beginner and I disassemble the code of vectorAdd.cu

__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        C[i] = A[i] + B[i];
    }
}

the corresponding disassembly:

code for sm_30
		Function : _Z9vectorAddPKfS0_Pfi
	.headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
                                                                                /* 0x2202e2c282823307 */
        /*0008*/                   MOV R1, c[0x0][0x44];                        /* 0x2800400110005de4 */
        /*0010*/                   S2R R0, SR_CTAID.X;                          /* 0x2c00000094001c04 */
        /*0018*/                   S2R R3, SR_TID.X;                            /* 0x2c0000008400dc04 */
        /*0020*/                   IMAD R0, R0, c[0x0][0x28], R3;               /* 0x20064000a0001ca3 */
        /*0028*/                   ISETP.GE.AND P0, PT, R0, c[0x0][0x158], PT;  /* 0x1b0e40056001dc23 */
        /*0030*/               @P0 EXIT;                                        /* 0x80000000000001e7 */
        /*0038*/                   ISCADD R4.CC, R0, c[0x0][0x140], 0x2;        /* 0x4001400500011c43 */
                                                                                /* 0x22c04282c04282b7 */
        /*0048*/                   MOV32I R7, 0x4;                              /* 0x180000001001dde2 */
        /*0050*/                   IMAD.HI.X R5, R0, R7, c[0x0][0x144];         /* 0x208e800510015ce3 */
        /*0058*/                   ISCADD R2.CC, R0, c[0x0][0x148], 0x2;        /* 0x4001400520009c43 */
        /*0060*/                   LD.E R4, [R4];                               /* 0x8400000000411c85 */
        /*0068*/                   IMAD.HI.X R3, R0, R7, c[0x0][0x14c];         /* 0x208e80053000dce3 */
        /*0070*/                   LD.E R2, [R2];                               /* 0x8400000000209c85 */
        /*0078*/                   ISCADD R6.CC, R0, c[0x0][0x150], 0x2;        /* 0x4001400540019c43 */
                                                                                /* 0x20000002f04283f7 */
        /*0088*/                   IMAD.HI.X R7, R0, R7, c[0x0][0x154];         /* 0x208e80055001dce3 */
        /*0090*/                   FADD R0, R2, R4;                             /* 0x5000000010201c00 */
        /*0098*/                   ST.E [R6], R0;                               /* 0x9400000000601c85 */
        /*00a0*/                   EXIT;                                        /* 0x8000000000001de7 */
        /*00a8*/                   BRA 0xa8;                                    /* 0x4003ffffe0001de7 */
        /*00b0*/                   NOP;                                         /* 0x4000000000001de4 */
        /*00b8*/                   NOP;                                         /* 0x4000000000001de4 */

what’s the meaning of c[0x0][0x44], c[0x0][0x28], c[0x0][0x140], c[0x0][0x144], c[0x0][0x148], c[0x0][14c], c[0x0][0x150], c[0x0][0x154]?
How can I check the value of c[0x0][???] by cuda-gdb or other methods?

Thank you~!

cbuchner1 · May 3, 2018, 2:37pm

it’s SASS code, not officially documented by nVidia. The implementation details change between hardware generations (Fermi, Kepler, Maxwell, Pascal, Volta, …)

The CUDA utility cuobjdump -sass can dump the SASS code from a cuda binary object

There’s an unofficial SASS assembler for the Maxwell architecture written by Scott Gray

I do not know how to get low level data access to machine registers at SASS level, unfortunately.

Christian

Robert_Crovella · May 3, 2018, 4:09pm

Those are referring to locations in constant memory. Constant memory has some partitions. Constant memory is used to provide kernel arguments to the device code.

sian1987o · May 4, 2018, 1:29am

Thank you for your answer, and how can I check the these variables memory by cuda-gdb?

sian1987o · May 4, 2018, 1:30am

Thanks very much! I know the maxas, but its too difficult to me now~!

hazelnutvt04 · December 11, 2019, 4:43pm

I don’t understand why it’s doing:

IMAD.HI.X R5, R0, R7, c[0x0][0x144];

It never uses R5 after this

Same for R3 here:

IMAD.HI.X R3, R0, R7, c[0x0][0x14c];

In both cases, I think it’s doing an i (the ‘global index’) * 4 + some base address and the result is stored in R5/R3 and then does nothing with the result?

cbuchner1 · December 11, 2019, 4:57pm

The IMAD.HI.X instructions storing results in R3 and R5 would still make sense (despite the output registers remaining unused) if the following ISCADD R*.CC instructions makes use the carry bit created by these multiplications.

Christian

Greg · December 11, 2019, 5:29pm

CONSTANTS

CUDA SASS (Disassembly) is close to PTX.

c[bank][offset] is the syntax for a reference to an indexed constants. Indexed constants are heavily used to reference

Per module constant variables
Per module constant literals (const double = 1.0) that cannot be encoded directly into instructions
Per launch user kernel parameters (up to 4KB)
Per launch driver kernel parameters (local memory base address, GridDim, BlockDim)

The bank for module level constants will be different from per kernel launch constants.

The banks and offsets will differ between architectures/chips.

DEBUGGER - READING CONSTANTS

The Windows CUDA debugger can read module constants (bank=0, c[0][#]) in the memory view using the syntax
(constant int*)0. For c[0][0x100] use (constant int*)0x100.

The Windows CUDA debugger can view the 4KiB of kernel parameters using (params int*)0. This maps to c[3][0x140] or c[3][0x160] depending on the architecture.

I have filed a Request For Enhancement with the debugger teams.

cuda-gdb does not appear to support reading constants.

64-bit ADDRESSES

The example you specified IMAD.HI.X R5, R0, R7, c[0x0][0x144]; is doing the operation

R5 = R0 x R7 + c[0][0x144]

This is forming the 64-bit address (R5 << 32) + R4. In SASS 64-bit addresses are referenced by the lower register. For example you are likely to see

LDG.E.32 R0, [R4] // load 32-bit from extended address R4/R5 (extended meaning 64-bit address vs. 32-bit address)

njuffa · December 11, 2019, 5:47pm

That certainly applies, although 1.0 isn’t the best example, as that can be encoded into instructions in many cases. const double pi = 3.14159265358979323 would be a more suitable example.

#include <stdio.h>
#include <stdlib.h>

__global__ void kernel (double x)
{
    cont double one = 1.0;
    x = x + one;
    x = fma (x, one, x);
    if (x > one) {
        x = x - one;
    }
    printf ("x=%23.15e\n", x);
}

int main (void)
{
    kernel<<<1,1>>>(0.1);
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

arch = sm_61
code version = [1,7]
producer = cuda
host = windows
compile_size = 64bit

        code for sm_61
                Function : _Z6kerneld
        .headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"

        /*0008*/                   MOV R1, c[0x0][0x20];
        /*0010*/                   MOV32I R2, 0x0;
        /*0018*/                   MOV32I R3, 0x3ff00000;

        /*0028*/                   DADD R2, R2, c[0x0][0x140];
        /*0030*/                   DFMA R2, R2, 1, R2;                     <<<<<
        /*0038*/                   IADD32I R1, R1, -0x8;

        /*0048*/                   DSETP.GT.AND P0, PT, R2.reuse, 1, PT;   <<<<<
        /*0050*/                   DADD R4, R2, -1;                        <<<<<
        /*0058*/                   IADD R6.CC, R1, c[0x0][0x4];

        /*0068*/                   IADD R0, R6, -c[0x0][0x4];
        /*0070*/                   IADD.X R7, RZ, c[0x0][0x104];
        /*0078*/                   SEL R2, R4, R2, P0;

        /*0088*/                   SEL R3, R5, R3, P0;
        /*0090*/         {         MOV32I R4, 0x0;
        /*0098*/                   STL.64 [R0], R2;        }

        /*00a8*/                   MOV32I R5, 0x0;
        /*00b0*/                   JCAL 0x0;
        /*00b8*/                   EXIT;

        /*00c8*/                   BRA 0xc0;

Topic		Replies	Views
[Solved]SASS Code Analysis CUDA Programming and Performance	5	8306	November 30, 2017
What does LOP.AND.NZ do? CUDA Programming and Performance	13	1274	December 16, 2020
An Easy Introduction to CUDA C and C++ Technical Blog	48	1282	July 19, 2018
cuda SASS question CUDA Programming and Performance	4	1887	June 18, 2018
What is code compiled with -arch=sm_13 slower? CUDA Programming and Performance	16	3662	April 22, 2009
CUDA never stops CUDA Programming and Performance	4	2782	March 20, 2012
Build Error MSB3721 When calling object method within kernel, using compiler directives CUDA Programming and Performance	9	5729	November 18, 2015
newbie in Cuda needs help with 2D arrays CUDA Programming and Performance	9	929	March 9, 2018
Help a newbie with CUDA development! CUDA Programming and Performance	16	14559	May 1, 2008
Help with strange error CUDA Programming and Performance	8	2107	February 25, 2010

The meaning of CUDA disassemly

Related topics