GPU: Blocks, Threads, Multiprocessors, and Cuda Cores clarification Help clarifying the terms

Hello all,

I need some clarification on the terms Blocks, Threads, Multiprocessors, and Cuda Cores and whats the maximum value for each one. I have an evga GTX 560TI 2GB (Fermi) GPU

From what I gathered: There are 32 cuda cores per multiprocessor(SM)?
each (SM) can execute 46 warps
each warp can execute 32 threads
and the number of threads running in parallel matches the number of CUDA cores (384 Processor Cores in my case)
Are these values correct?

So I have 384/32 = 12 SMs, meaning that I can only have 124632 = 17664 threads active at once (but not technically running in parallel)? But this does not seem correct.

But wait, what about Blocks, where do they fit in the picture? How man blocks can I have? Are they in ratio to the amount of cuda cores that I have? Is there any point to have more blocks then cuda cores in terms of performance?

I have read a bunch of the Nvidia programming guides, and things are very unclear about these matters.

Thank you very much in advance, sorry about all of the questions.

Could you run the device query program and post the results here?

Hey, thanks for the reply.

I’ll run it later on tonight when I get home from work.

This is from my card 8800mgtx:

Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Each block is submitted to a multiprocessor, but only 32 threads (warp size) are executed at a time. Each block can have maximum 512 threads which cab arranged in a 3D grid, while the blocks can be only in a 2D grid. So in total there executed 32 threads x number_of_multiprocessor at a time, but you can submit kernel with a total of 512x65535x65535 number of threads.

Device “GeForce GTX 560 Ti”

Cuda Capability: 2.1

Total amount of global memory: 2014MB

(8) Multiprocessors * (48) Cuda Corse/MP: 384 CUDA cores

Wrap Size: 32

Max threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

So I understand what this is all saying now. Thanks for the help. However my last question. Is there any performance benefit to having more than 384 blocks initialized at a time?

I am not sure about that you will need to try both ways, many blocks or fewer blocks and see which one is better. Maybe you meant threads and yes there is benefit to have more than 384 threads, because the gpu hides the latency which arise from memory reads by pausing threads and starting to execute new threads, it always depends on the problem an you need to test a lot.

I made a nifty Fermi occupancy table. Use this to understand the relationship between block size, register count, shared memory, and occupancy:
http://www.moderngpu.com/intro/workflow.html#Occupancy