Registers per SM GTX 460

Hi !

Well I’m trying to understand all the thing about blocks, threads, registers, occupancy, etc, etc, but first things first so here are my first questions…

Device 0: "GeForce 320M"

  CUDA Driver Version:                           3.20

  CUDA Runtime Version:                          3.20

  CUDA Capability Major revision number:         1

  CUDA Capability Minor revision number:         2

  Total amount of global memory:                 265027584 bytes

  Number of multiprocessors:                     6

  Number of cores:                               48

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 16384

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             256 bytes

  Clock rate:                                    0.95 GHz

  Concurrent copy and execution:                 Yes

  Run time limit on kernels:                     Yes

  Integrated:                                    Yes

  Support host page-locked memory mapping:       Yes

  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

For this gfx card the number of THREADS PER BLOCK is 512 and the number of REGISTER per block is 16384, am I correct if I say that

[b]For each block on the grid I have 512 threads AND

For each thread ideally I have 32 registers[/b]

How can I count the number of registers used by a each thread in my CUDA programm?

  1. 512 is the maximum
  2. You could have a maximum of more than 32 regs/thread if your number of thread is less than 512. However, there are also some alignment requirement for register allocation so the final regs/thread you get will be less than 16K/number of thread per block
  3. There is an option either in nvcc or ptxas that gives you an output of the number of registers used per kernel thread. I never used it before. You could use google to search the forum for that.

Thank you very much. So I’m doing well…Now, the next questions are:

How many blocks are the maximum?

How many blocks should I launch? Here in the deviceQuery it says that the max size of grid is 65535 x 65535 x 1, but I don’t get this part…

Blocks can be 3 dimensional. This means that threads in a block can have 3D indices. The three indices are defined in threadIdx.x, threadIdx.y and threadIdx.z Of course much of the time people just need 1 D block so often threadIdx.y and z are not used.

Similarly, grids can be 3D too. However, on your device only 2D is achievable as the z dimension is limited to 1.

Please read the programming guide and the book cuda by example to learn more.

Thanks, I have read those things, but I still don’t get it, I read that I should launch a number of blocks to hide latency (or have the most occupancy) and these things depend on compute capability but, what would be the equation to know the minimum of blocks to launch?

You want a minimum number of warps per SM to ensure that latency is fully hidden. Whether this number of warps is achieved by having several blocks run concurrently on each SM, of having more threads (=more warps) per block does not matter.

Most instructions seem to have a latency between 20…24 cycles on the various GPU architectures CUDA runs on. Assuming that each instruction uses results from the previous instruction in the thread, you thus want enough warps to fill these up to 24 cycles with independent instructions from different warps.

Without double-issue, a compute capability 1.x device can issue one instruction every 4 cycles, so you need 6 active warps to cover the 24 cycles of latency. A compute capability 2.x device can issue two instructions every 2 cycles, so you want 24 active warps there.

If your code allows two instructions to be issued simultaneously, these numbers double. (This probably explains why the maximum number of resident warps per SM was raised for compute capability 2.x devices). Note that it is possible to write code where in the inner loop bodies no instruction depends on the results from the previous one, so you will need fewer warps to completely hide instruction latency there.

Thanks !

So it doesn’t matter at all how many blocks I launch, what I have to keep in mind is completely hide latency, I mean launch as many threads as needed to keep at least a warp running per cycle? and once I know how many threads I need to launch I can arrange them in blocks and threads per block as I want?

Yes.

All other things equal, there is a very slight advantage in having multiple independent blocks per SM. Threads within a block tend to run in lockstep, so they will all emit their memory transactions at the same time. Independent blocks might just have a few more computations to do at that time, helping to even out the load. Also during a __syncthreads() some warps will be inactive waiting for the last warp in the block, if there are no warps from independent blocks there won’t be enough warps to completely hide latency.

Usually, there will be other reasons though that favor a larger blocksize, like better reuse of data loaded to shared memory. These will commonly outweigh the slight advantage of multiple smaller blocks I described above.