[Solved]SASS Code Analysis

iamkaka · January 28, 2016, 6:58am

Hello, everyone.
I use cuobjdump to generate the SASS code, see below. These codes try to load global memory.

    /*0028*/         IMAD R6.CC, R3, R5, c[0x0][0x20]; 
    /*0030*/         IMAD.HI.X R7, R3, R5, c[0x0][0x24]; 
    /*0040*/         LD.E R2, [R6]; //load

Where can i get the full manual of SASS code that explain the meaning of each instruction. In "cuda binary utility " , It only provide a general explanation of the meaning of the instruction. e.g. it doesn’t explain the meaning of “R1.cc”, “IMAD.HI.X” and LD.e .
What is meaning of second instruction. I guess that the first one is to compute the memory address that each thread should load, while the third instruction is to load global memory into register. I have no idea on the meaning of second instruction.
I guess that cuda save some parameter information like grid size, block size and array base address into constant memory.
In this case, c[0x0][0x20] is the base address of an array. My question is how can i get those information.

BulatZiganshin · January 28, 2016, 9:26am

nvidia doesn’t officially disclose sass details, and moreover - it changes in every major SM generation. the way i learned sass is

ptx manual: http://docs.nvidia.com/cuda/parallel-thread-execution/
http://docs.nvidia.com/cuda/cuda-binary-utilities/#instruction-set-ref
read wiki of asfermi project: Google Code Archive - Long-term storage for Google Code Project Hosting.
read manual of kepler sass: https://hpc.aliyun.com/doc/keplerAssemblerUserGuide
there is also maxas, but its docs doesn’t describe commands

in your code, r6.cc means “write carry to 1-bit CC register”, and mad.hi.x computes high 32 bits of result and adds carry from CC register. LD.E is a load from global memory using 64-bit address in R6,R7. the entire code is

R6 = R3R5+c[0x0][0x20], saving carry to CC
R7 = (R3R5+c[0x0][0x24])>>32 + CC
R2 = *(R7<<32+R6)

first two commands multiply two 32-bit values (R3 and R5) and add 64-bit value c[0x0][0x24]<<32+c[0x0][0x20], leaving 64-bit result in the R6,R7 pair

c[BANK][ADDR] is a constant memory, c[0x0][0x20] is the first kernel parameter, so the entire code is:

kernel f (uint32* x) // 64-bit pointer
{
R2 = x[R3*R5]
}

unfortunately, there are no much books with low-level GPU details. the best i have seen is http://www.cudahandbook.com/ , in particular it describes those c references

8.1.4 CONSTANT MEMORY
Constant memory resides in device memory, but it is backed by a different,
read-only cache that is optimized to broadcast the results of read requests to
threads that all reference the same memory location. Each SM contains a small,
latency-optimized cache for purposes of servicing these read requests. Making
the memory (and the cache) read-only simplifies cache management, since the
hardware has no need to implement write-back policies to deal with memory
that has been updated.

Two more books going into low-level details are:
Shane Cook “CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs”
Rob Farber “CUDA Application Design and Development”

njuffa · January 28, 2016, 12:09pm

/0028/ IMAD R6.CC, R3, R5, c[0x0][0x20];
/0030/ IMAD.HI.X R7, R3, R5, c[0x0][0x24];
/0040/ LD.E R2, [R6]; //load

NVIDIA’s SASS shares many similarities with other instruction sets. IMAD is an integer multiply-and-add operation. In many assembly languages .CC means “set the flags”, in particular generate a carry-out. .X specifies an “extended” operation with carry-in. .HI will produce the upper half of a double-length product. c[0] is a particular constant bank (there are several constant memory banks in GPUs, and how many there are and how they are used changes with architecture). LD.E is a regular load through the cache hierarchy.

So what the above computation does is: multiply the 32-bit quantities in registers R3 and R5, and add the full 64-bit product to the 64-bit integer stored in c[0][0x24]:c[0][0x20], delivering the result to the register pair R7:R6, which now constitutes a 64-bit address. Note that the load instruction specifies R6 as the register containing the address, the actual use of a register pair and use of R7 is implicit. LD loads the data at the address contained in R7:R6 into 32-bit register R2. The data loaded could be a 32-bit integer or a 32-bit single-precision floating-point number, which one it is should be clear from the use of R2.

iamkaka · January 31, 2016, 5:28am

Thank you for your reply!

nvidia doesn’t officially disclose sass details, and moreover - it changes in every major SM generation. the way i learned sass is

ptx manual: PTX ISA 8.3

1. Overview — cuda-binary-utilities 12.3 documentation

read wiki of asfermi project: Google Code Archive - Long-term storage for Google Code Project Hosting.

read manual of kepler sass: https://hpc.aliyun.com/doc/keplerAssemblerUserGuide

there is also maxas, but its docs doesn’t describe commands

in your code, r6.cc means “write carry to 1-bit CC register”, and mad.hi.x computes high 32 bits of result and adds carry from CC register. LD.E is a load from global memory using 64-bit address in R6,R7. the entire code is

R6 = R3R5+c[0x0][0x20], saving carry to CC
R7 = (R3R5+c[0x0][0x24])>>32 + CC
R2 = *(R7<<32+R6)

first two commands multiply two 32-bit values (R3 and R5) and add 64-bit value c[0x0][0x24]<<32+c[0x0][0x20], leaving 64-bit result in the R6,R7 pair

c[BANK][ADDR] is a constant memory, c[0x0][0x20] is the first kernel parameter, so the entire code is:

kernel f (uint32* x) // 64-bit pointer
{
R2 = x[R3*R5]
}

unfortunately, there are no much books with low-level GPU details. the best i have seen is http://www.cudahandbook.com/ , in particular it describes those c references

8.1.4 CONSTANT MEMORY
Constant memory resides in device memory, but it is backed by a different,
read-only cache that is optimized to broadcast the results of read requests to
threads that all reference the same memory location. Each SM contains a small,
latency-optimized cache for purposes of servicing these read requests. Making
the memory (and the cache) read-only simplifies cache management, since the
hardware has no need to implement write-back policies to deal with memory
that has been updated.

Two more books going into low-level details are:
Shane Cook “CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs”
Rob Farber “CUDA Application Design and Development”

iamkaka · January 31, 2016, 5:30am

Thank you for your reply.

/0028/ IMAD R6.CC, R3, R5, c[0x0][0x20];
/0030/ IMAD.HI.X R7, R3, R5, c[0x0][0x24];
/0040/ LD.E R2, [R6]; //load

NVIDIA’s SASS shares many similarities with other instruction sets. IMAD is an integer multiply-and-add operation. In many assembly languages .CC means “set the flags”, in particular generate a carry-out. .X specifies an “extended” operation with carry-in. .HI will produce the upper half of a double-length product. c[0] is a particular constant bank (there are several constant memory banks in GPUs, and how many there are and how they are used changes with architecture). LD.E is a regular load through the cache hierarchy.

So what the above computation does is: multiply the 32-bit quantities in registers R3 and R5, and add the full 64-bit product to the 64-bit integer stored in c[0][0x24]:c[0][0x20], delivering the result to the register pair R7:R6, which now constitutes a 64-bit address. Note that the load instruction specifies R6 as the register containing the address, the actual use of a register pair and use of R7 is implicit. LD loads the data at the address contained in R7:R6 into 32-bit register R2. The data loaded could be a 32-bit integer or a 32-bit single-precision floating-point number, which one it is should be clear from the use of R2.

linacman · November 30, 2017, 12:47pm

In case anyone needs it the asKepler user guide has moved here:
https://help.aliyun.com/document_detail/25852.html?spm=5176.7937245.209071.7.41f06dbeFhk7fp
For the tool itself you probably need to pay for access.

PS. I know the thread is old, but google always throws me here.

Topic		Replies	Views
The meaning of CUDA disassemly CUDA Programming and Performance	8	1953	December 11, 2019
SASS, LDS.128, LD.128 and DRAM allocation CUDA Programming and Performance	7	3767	June 23, 2016
How to understand the sass code on RTX3080GPU and cuda11.8 CUDA Programming and Performance cuda , kernel	6	645	November 28, 2024
What does LOP.AND.NZ do? CUDA Programming and Performance	13	1271	December 16, 2020
About LD instruction for wmma CUDA Programming and Performance	2	530	July 5, 2023
cuda SASS question CUDA Programming and Performance	4	1871	June 18, 2018
performance difference for cuda between experiments and the documentation for float/double data type... CUDA Programming and Performance	8	1903	October 28, 2016
Ptxas slow CUDA Programming and Performance cuda , kernel	35	1999	May 2, 2024
How to eliminate address computation instructions in the SASS code CUDA Programming and Performance	5	693	October 3, 2020
What's different between LD and LDG (load from generic memory vs. load from global memory) CUDA Programming and Performance	10	10856	March 13, 2022

[Solved]SASS Code Analysis

Related topics