After reading my cublin file, I noticed that I use 28 registers per threads and no local or shared memory (except for kernel parameters). So I tried to store some variables in shared memory and local memory to increase the occupancy rate of my SM… without success.
(is not efficient of course but it was a test), the number of register in the new cubin file stay the same (but the SMEM value increase a lot of course)! External Image
Same by using global memory instead of registers…
I’am trying to compare the ptx files of different tests, but it is not so easy at first…
You can enable nvcc to output ptx with C source in comments with the command line option --opencc-options -LIST:source=on .
What I have found when I try to reduce register usage in this way is that the values need to be read into registers before being used in arithmetic operations: so you still need the register in many cases.
Ok, but I hope only one register per thread is used for all wraps of shared/local memory for arithmetic operations !
In the opposite the total SMEM available should depend of the number of registers. So basically, with 8192 32bit-registers by SM, the shared memory could only use the half of its 16KB (may be my explanation is not very clear… as my english in fact)
Thinking about it more, it seems likely that nvcc (or ptxax) sees that you are loading the same value from shared memory over and over again and stuffs it in a register for you.
By chance today I was paging through the PTX ISA which mentions using ld.volatile.* to read values from shared memory when you don’t want the compiler to cache the value in a register.
Try “volatile value = mat[index];” and then use value in math operations and see if your register use goes down. Otherwise decuda might be your only option :(
I’ve tried the source=on options for ptx generation (-ptx --opencc-options -LIST:source=on) without remarkable success. Nvcc has only commented a couple of strings in the kernel (related to the typical grid-indexed loop), but did not comment even a string in the device function that I call from the kernel (that is inlined of course).