ptxas info , why so many lines?

Hello , in my program I am running 3 kernels.

When I compile with

ptxas-options="-v"

, I am receiving:

...
   0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 13 registers, 368 bytes cmem[0]
...
 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 368 bytes cmem[0]
...
  0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 34 registers, 5408 bytes smem, 376 bytes cmem[0], 24 bytes cmem[2]
...
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 18 registers, 416 bytes cmem[0]
...
 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 34 registers, 5408 bytes smem, 376 bytes cmem[0], 24 bytes cmem[2]
...
  0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 32 registers, 36288 bytes smem, 416 bytes cmem[0], 40 bytes cmem[2]
...
 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 31 registers, 5408 bytes smem, 368 bytes cmem[0], 24 bytes cmem[2]
  1. Why so many information since I am launching 2 kernels?

  2. The number of the registers is per thread?And the number of the smem and cmem is per MP?

So , if I want to find the number of the registers my program uses ,I just add all the above registers and multiply by number of threads?

Thanks!

Hello,

Each kernel generates a set of information. am not sure why do you have so many. I would say you have 7 kernels. Are you using some cuda library in addition to your code?

The number of registers is per thread while the shared memory is per block. This numbers can help you to find out for a given launching configuration the resources needed and the maximum theoretical occupancy .

Hello ,

Hmm , yes I am using thrust and cublas ,that’s why I get all these ,right?

Also, in order to find the total number of registers my program uses , I have to sum up all the above ,right?And multiply by threads.

The shared memor is per block or per MP?

Thanks!

The libraries are usually composed of many kernels and you get information for each of the individual kernels.
The shared memory is per block. An MP can have more than 1 active block. For architecture 3.5 you can have 2048 active threads per MP (maximum number of threads per block is 1024). If your block size is 512 this means you need to optimize the usage of registers (allow spilling maybe) and shared memory to be able to have 4 blocks on one MP for maximum occupancy.

For the ready made libraries I would not worry. These should already be optimized.

Ok , thanks . I just saw that at every ptxas info it states the name of the fuction.

Just , in order to compute the total number of threads?I sum up all the above?Including the calls to thrust functions?