Relationship between Warp, MP, Block, Shared Memory

Hi, Friend!

I am a beginer for CUDA.
As I know, a warp is the number of threads to be executed concurrently by a multiprocessor.

  1. If it is right, does foo<<<100, 32> mean that foo will be ececuted 100 times by a warp?
    If it is right, it means that foo should be executed by only a multiprocessor even if my graphic card has 4 multiprocessors.

  2. If my card has 4 multiprocessors, does foo<<<100, 32>>> mean that foo will be executed 25 times by each multiprocessor?

  3. If my card has 4 multiprocessors, does foo<<<100, 16>>> also mean that foo will be executed 25 times by each multiprocessor?

  4. If my card has 4 multiprocessors, does foo<<<1, 90>>> mean that foo will be executed once by 3 multiprocessors?

  5. In the 4th case, can 90 threads use the same shared memory? I know that the threads in a multiprocessor only can access the same shared emnory.

Please be generous for my poor English.

A block can only be processed on a single MP, you cannot split a block over multiple MPs (so if you run less blocks than you have MPs you automatically underutilize the hardware).

A single MP can process many blocks, if resources permit. MP’s schedule granularity is the warp size, so an MP’s queue can look like:
warp 0 from block 1
warp 10 from block 4
warp 2 from block 0
warp 1 from block 1
(assume no particular ordering or scheduling)

It’s a one-to-many relationship.

Threads from different blocks cannot share memory even if they happen to be processed on the same MP. Using shared memory is restricted to threads running in the same block, not running on the same MP. Note that threads from a single block are guaranteed to end up on the same MP.