maximum thread numbers

If I using GTX 580, then what is the maximum thread that i can made?

i.e kernel<<<dim_grid, dim_block>>>

what is the maximum value of dim_grid * dim_block that I can get the best performance?

Run the program NVIDIA_GPU_Computing_SDK/C/bin/linux/release/./deviceQuery. The way you choose the number will depend on the details of each problem.

./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

Device 0: “GeForce GTX 480”

CUDA Driver Version / Runtime Version 4.0 / 4.0

CUDA Capability Major/Minor version number: 2.0

Total amount of global memory: 1536 MBytes (1610285056 bytes)

(15) Multiprocessors x (32) CUDA Cores/MP: 480 CUDA Cores

GPU Clock Speed: 1.45 GHz

Memory Clock rate: 1900.00 Mhz

Memory Bus Width: 384-bit

L2 Cache Size: 786432 bytes

Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)

Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Concurrent copy and execution: Yes with 1 copy engine(s)

Run time limit on kernels: No

Integrated GPU sharing Host Memory: No

Support host page-locked memory mapping: Yes

Concurrent kernel execution: Yes

Alignment requirement for Surfaces: Yes

Device has ECC support enabled: No

Device is using TCC driver mode: No

Device supports Unified Addressing (UVA): Yes

Device PCI Bus ID / PCI location ID: 133 / 0

Compute Mode:

 < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 1, Device = GeForce GTX 480

[deviceQuery] test results…


this is the result of device query.

But it gives me the # of cores and maximum number of threads per block. But I don’t know the # of blocks that hold by one core.

Each core can handle 8 Blocks Maximum.

Also each Core can Handle max. 1536 Threads.

Actually you only need 480 threads to get all cores working at the same time. (480 Cuda Cores). But to Hide memory accesses ect. pp you should start many many many more.

x, y, and z of dim_grid up to 65535 each. x and y of dim_block up to 1024 width the additional constraint that total block size dim_block.xdim_block.ydim_block.z < 1024. Check appendix F of the Programming Guide for this.

There are no maximum values after which performance would drop, only minimum values.


From the wikipedia page I can only infer that there can be 65535 x 65535 x 65535 with 1024 threads per block. That would be the maximum size, but I realized now that your question is different. The GPU parallelism behaves different from a other parallel implementations. While in OPENMP or MPI you would expect to have a max number of threads or processes above which there is no benefit in CUDA you get better performances by having smaller tasks in many threads, because the latency (time spent to access memory) is hidden. Each block is executed on a MP and all the threads in the block share the so called “SHARED MEMORY”. The programmer has direct access to shared memory which is very fast to access.

The answer to you question depends on the details of the problem.