Understanding deviceQuery

The deviceQuery gives me the following output:

Device 0: “GeForce GTX 650”
CUDA Driver Version / Runtime Version 6.0 / 6.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147287040 bytes)
( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 1058 MHz (1.06 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

I don’t understand it completely. I tried looking up in the internet and ended up getting more confused.

Here are few questions I would like to know:

  1. How many blocks and threads I can launch at a time? It says Maximum number of threads per block: 1024, however it also says Max dimension size of a thread block (x,y,z): (1024, 1024, 64). So, I can have a block of dimension (1024,1024,64) and would all the the threads run simultaneously?

  2. How does number of multi-processors and number of cores/mp affect my programming?

  3. Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535). What does this mean? How do I calculate the maximum blocks and threads/block I can deploy?

Regards

The maximum number of threads per block is 1024, so you must choose the dimensions of the thread block such that xyz <= 1024, and each of the individual limits on x, y, z is observed. So you could have a (1024,1,1) block, or a (1,1024,1) block, or a (32,32,1) block or a (4,4,64) block, etc. Often times, better performance is achieved by using smaller thread blocks containing only 128 or 256 threads. See the Best Practices Guide for guidance.

For the grid you just need to stay within the stated limits for each dimension. The maximum number of threads would be the maximum number of blocks in a grid, which is 2147483647 * 65535 * 65535, times 1024 which is the maximum number of threads in each block. I have yet to encounter a real-life application that fully exploits the maximum grid dimensions.

In general, the shape and size of grids and thread blocks is a function of how you map data to threads. For example, you may operate on 2D data in such a way that each 16x16 thread block handles a 16x16 sub-matrix, and the grid is sized to allow for however many blocks are needed to tile the entire matrix. To stay with the example, if the entire matrix is 2048x1024 elements, you would need a grid of 128x64 blocks.

This clears a lot of doubts.
However, two things remain:

  1. What is the significance of ( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores? How can I use this information?

  2. What do we mean by Max dimension size of a thread block (x,y,z): (1024, 1024, 64)? particularly, in contrast to the fact that max no. of threads in a block is 1024.