Where can i get the full manual of SASS code that explain the meaning of each instruction. In "cuda binary utility " , It only provide a general explanation of the meaning of the instruction. e.g. it doesn’t explain the meaning of “R1.cc”, “IMAD.HI.X” and LD.e .
What is meaning of second instruction. I guess that the first one is to compute the memory address that each thread should load, while the third instruction is to load global memory into register. I have no idea on the meaning of second instruction.
I guess that cuda save some parameter information like grid size, block size and array base address into constant memory.
In this case, c[0x0][0x20] is the base address of an array. My question is how can i get those information.
there is also maxas, but its docs doesn’t describe commands
in your code, r6.cc means “write carry to 1-bit CC register”, and mad.hi.x computes high 32 bits of result and adds carry from CC register. LD.E is a load from global memory using 64-bit address in R6,R7. the entire code is
R6 = R3R5+c[0x0][0x20], saving carry to CC
R7 = (R3R5+c[0x0][0x24])>>32 + CC
R2 = *(R7<<32+R6)
first two commands multiply two 32-bit values (R3 and R5) and add 64-bit value c[0x0][0x24]<<32+c[0x0][0x20], leaving 64-bit result in the R6,R7 pair
c[BANK][ADDR] is a constant memory, c[0x0][0x20] is the first kernel parameter, so the entire code is:
unfortunately, there are no much books with low-level GPU details. the best i have seen is http://www.cudahandbook.com/ , in particular it describes those c references
8.1.4 CONSTANT MEMORY
Constant memory resides in device memory, but it is backed by a different,
read-only cache that is optimized to broadcast the results of read requests to
threads that all reference the same memory location. Each SM contains a small,
latency-optimized cache for purposes of servicing these read requests. Making
the memory (and the cache) read-only simplifies cache management, since the
hardware has no need to implement write-back policies to deal with memory
that has been updated.
Two more books going into low-level details are:
Shane Cook “CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs”
Rob Farber “CUDA Application Design and Development”
NVIDIA’s SASS shares many similarities with other instruction sets. IMAD is an integer multiply-and-add operation. In many assembly languages .CC means “set the flags”, in particular generate a carry-out. .X specifies an “extended” operation with carry-in. .HI will produce the upper half of a double-length product. c[0] is a particular constant bank (there are several constant memory banks in GPUs, and how many there are and how they are used changes with architecture). LD.E is a regular load through the cache hierarchy.
So what the above computation does is: multiply the 32-bit quantities in registers R3 and R5, and add the full 64-bit product to the 64-bit integer stored in c[0][0x24]:c[0][0x20], delivering the result to the register pair R7:R6, which now constitutes a 64-bit address. Note that the load instruction specifies R6 as the register containing the address, the actual use of a register pair and use of R7 is implicit. LD loads the data at the address contained in R7:R6 into 32-bit register R2. The data loaded could be a 32-bit integer or a 32-bit single-precision floating-point number, which one it is should be clear from the use of R2.