[Solved]SASS Code Analysis

Hello, everyone.
I use cuobjdump to generate the SASS code, see below. These codes try to load global memory.

    /*0028*/         IMAD R6.CC, R3, R5, c[0x0][0x20]; 
    /*0030*/         IMAD.HI.X R7, R3, R5, c[0x0][0x24]; 
    /*0040*/         LD.E R2, [R6]; //load
  1. Where can i get the full manual of SASS code that explain the meaning of each instruction. In "cuda binary utility " , It only provide a general explanation of the meaning of the instruction. e.g. it doesn’t explain the meaning of “R1.cc”, “IMAD.HI.X” and LD.e .

  2. What is meaning of second instruction. I guess that the first one is to compute the memory address that each thread should load, while the third instruction is to load global memory into register. I have no idea on the meaning of second instruction.

  3. I guess that cuda save some parameter information like grid size, block size and array base address into constant memory.
    In this case, c[0x0][0x20] is the base address of an array. My question is how can i get those information.

nvidia doesn’t officially disclose sass details, and moreover - it changes in every major SM generation. the way i learned sass is

  1. ptx manual: http://docs.nvidia.com/cuda/parallel-thread-execution/
  2. http://docs.nvidia.com/cuda/cuda-binary-utilities/#instruction-set-ref
  3. read wiki of asfermi project: Google Code Archive - Long-term storage for Google Code Project Hosting.
  4. read manual of kepler sass: https://hpc.aliyun.com/doc/keplerAssemblerUserGuide
  5. there is also maxas, but its docs doesn’t describe commands

in your code, r6.cc means “write carry to 1-bit CC register”, and mad.hi.x computes high 32 bits of result and adds carry from CC register. LD.E is a load from global memory using 64-bit address in R6,R7. the entire code is

R6 = R3R5+c[0x0][0x20], saving carry to CC
R7 = (R3
R5+c[0x0][0x24])>>32 + CC
R2 = *(R7<<32+R6)

first two commands multiply two 32-bit values (R3 and R5) and add 64-bit value c[0x0][0x24]<<32+c[0x0][0x20], leaving 64-bit result in the R6,R7 pair

c[BANK][ADDR] is a constant memory, c[0x0][0x20] is the first kernel parameter, so the entire code is:

kernel f (uint32* x) // 64-bit pointer
{
R2 = x[R3*R5]
}

unfortunately, there are no much books with low-level GPU details. the best i have seen is http://www.cudahandbook.com/ , in particular it describes those c references

8.1.4 CONSTANT MEMORY
Constant memory resides in device memory, but it is backed by a different,
read-only cache that is optimized to broadcast the results of read requests to
threads that all reference the same memory location. Each SM contains a small,
latency-optimized cache for purposes of servicing these read requests. Making
the memory (and the cache) read-only simplifies cache management, since the
hardware has no need to implement write-back policies to deal with memory
that has been updated.

Two more books going into low-level details are:
Shane Cook “CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs”
Rob Farber “CUDA Application Design and Development”

1 Like

/0028/ IMAD R6.CC, R3, R5, c[0x0][0x20];
/0030/ IMAD.HI.X R7, R3, R5, c[0x0][0x24];
/0040/ LD.E R2, [R6]; //load

NVIDIA’s SASS shares many similarities with other instruction sets. IMAD is an integer multiply-and-add operation. In many assembly languages .CC means “set the flags”, in particular generate a carry-out. .X specifies an “extended” operation with carry-in. .HI will produce the upper half of a double-length product. c[0] is a particular constant bank (there are several constant memory banks in GPUs, and how many there are and how they are used changes with architecture). LD.E is a regular load through the cache hierarchy.

So what the above computation does is: multiply the 32-bit quantities in registers R3 and R5, and add the full 64-bit product to the 64-bit integer stored in c[0][0x24]:c[0][0x20], delivering the result to the register pair R7:R6, which now constitutes a 64-bit address. Note that the load instruction specifies R6 as the register containing the address, the actual use of a register pair and use of R7 is implicit. LD loads the data at the address contained in R7:R6 into 32-bit register R2. The data loaded could be a 32-bit integer or a 32-bit single-precision floating-point number, which one it is should be clear from the use of R2.

Thank you for your reply!

Thank you for your reply.

In case anyone needs it the asKepler user guide has moved here:
https://help.aliyun.com/document_detail/25852.html?spm=5176.7937245.209071.7.41f06dbeFhk7fp
For the tool itself you probably need to pay for access.

PS. I know the thread is old, but google always throws me here.