How to understand the following asm?

__global__ void TEST_prog(int *data_in1, int *data_in2, int *data_out) // employing IF functions

{

unsigned int tid = (blockIdx.x * blockDim.x) + threadIdx.x;

data_out[tid] = data_in1[tid] + data_in2[tid];

}

code for sm_10
	Function : _Z9TEST_progPiS_S_
.headerflags    @"EF_CUDA_SM10 EF_CUDA_PTX_SM(EF_CUDA_SM10)"

    /*0000*/        MOV.U16 R0H, g [0x1].U16;          /* 0x0023c78010004205 */
    /*0008*/        I2I.U32.U16 R1, R0L;               /* 0x04000780a0000005 */
    /*0010*/        IMAD.U16 R0, g [0x6].U16, R0H, R1; /* 0x0020478060014c01 */
    /*0018*/        SHL R2, R0, 0x2;                   /* 0xc410078030020009 */
    /*0020*/        IADD32 R0, g [0x4], R2;            /* 0x2102e800         */
    /*0024*/        IADD32 R3, g [0x6], R2;            /* 0x2102ec0c         */
    /*0028*/        GLD.U32 R1, global14[R0];          /* 0x80c00780d00e0005 */
    /*0030*/        GLD.U32 R0, global14[R3];          /* 0x80c00780d00e0601 */
    /*0038*/        IADD32 R1, R1, R0;                 /* 0x20008204         */
    /*003c*/        IADD32 R0, g [0x8], R2;            /* 0x2102f000         */
    /*0040*/        GST.U32 global14[R0], R1;          /* 0xa0c00781d00e0005 */
	...................................

Does RX[H/L] divide each register into high 16 bits and low 16 bits for use? I guess R0L in the second instruction stores threadIdx.x, but R0L has never been assigned before reading. Is this a hardware requirement, threadIdx.x is always placed in R0L? But where does blockIdx.x get it from? In addition, g [0x6] is used in both the third and sixth instructions. In the third instruction, I guess it is blockDim.x, but in the sixth instruction, I think it is the pointer in the kernel parameter. I’m so confused. Can you give me some advice? Thank you very much

The sm_10 devices that you are showing code for were obsoleted in approximately 2012, and were significantly different than the devices that came later, in terms of how they operated and their instruction set.

So you’re welcome to ask questions, but whatever knowledge you gain may be inapplicable or confusing when applied to current CUDA GPUs. Good luck!

sm_1x used a very different kernel parameter passing scheme than is used in modern GPUs. It used a chunk of shared memory and R0 for threadIdx (initialized by the hardware on kernel entry; I think threadIdx.y and threadIdx.z were packed into R0H).

There was (limited) support for 16-bit data and operations by splitting the general purpose registers into two halves, with H referring to the 16 most significant bits and L referring to the 16 least significant bits. This is particularly visible in the IMAD.U16 instruction here, which performs a 16 x16 + 32 -bit multiply add.

The (confusing) way one needs to read shared memory address references is that one needs to multiply the stated offset by the stated element size. So g[0x6].U16 refers to bytes 12, 13 and represents blockIdx.x, while g[0x6] refers to bytes 24, 25, 26, 27 and represents data_in2 (32-bit access is the default and therefore not annotated by the disassembler).

With annotations added, the code shown looks like this:

/*0000*/        MOV.U16 R0H, g [0x1].U16;          // blockDim.x
/*0008*/        I2I.U32.U16 R1, R0L;               // (uint32_t)threadIdx.x
/*0010*/        IMAD.U16 R0, g [0x6].U16, R0H, R1; // tid = (blockIdx.x * blockDim.x) + threadIdx.x
/*0018*/        SHL R2, R0, 0x2;                   // convert word offset to byte offset 
/*0020*/        IADD32 R0, g [0x4], R2;            // &data_in1[tid]   
/*0024*/        IADD32 R3, g [0x6], R2;            // &data_in2[tid] 
/*0028*/        GLD.U32 R1, global14[R0];          // data_in1[tid]
/*0030*/        GLD.U32 R0, global14[R3];          // data_in2[tid]  
/*0038*/        IADD32 R1, R1, R0;                 // data_in1[tid] + data_in2[tid]    
/*003c*/        IADD32 R0, g [0x8], R2;            // &data_out[tid] 
/*0040*/        GST.U32 global14[R0], R1;          // data_out[tid] = data_in1[tid] + data_in2[tid]

Unless this is an exercise in retrocomputing, I would strongly suggest not to look at sm_1x. It was an initial attempt at a compute-unified GPU architecture with minimal changes from previous graphics-only GPU architectures, and supported neither an ABI nor C++. It is wholly unrepresentative of modern GPU architectures, which started with sm_30 (Kepler).

1 Like

Thank you for your reply. The reason why I need to learn sm1.0 sass is that I recently looked at this hardware open source project: GitHub - Jerc007/Open-GPGPU-FlexGrip-: FlexGripPlus: an open-source GPU model for reliability evaluation and micro architectural simulation. It is compatible with the sm.10 cuda environment. There are relatively few gpgpu hardware open source projects. , so there aren’t many choices

Frankly, I am a little puzzled by this project. It seems someone spent a lot of time reverse engineering NVIDIA’s G80 GPU design from 2007 (or at least its compute portion), including the SFU, and apparently did so recently, as the project files have time stamps up to 2023. All this so they can replicate the G80 in an FPGA?

I am not sure what useful conclusion could be draw from using this for forward-looking research, given that GPU architectures have gone through several major revisions since then.

I think you are right. The first version of this project is very old. Currently, someone is adding more features based on previous work and supporting more Sass instructions.
The focus of the first version of this project is to study that the programmability of gpgpu is more flexible than that of fpga. There is no need to repeatedly burn fpga bit stream files. It only takes a very short time to run applications. However, burning fpga bit stream files requires a lot of time. time. The performance of the first version is compared with Microblaze CPU, not with GPU with the same architecture as nv.
However, someone recently added SFU and float unit support based on previous work. I am not sure why they continue to research based on such an old architecture.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.