Shared Memory Confusion

There are 16 multiprocessors and 16384 Bytes of Shared Memory in 8800 GTX.
There are NG Grid blocks of 8x8 threads.

I am confused whether
a. the 16384 Bytes are shared by all the grid blocks, so each 8x8 thread group could use 16384 / NG bytes of shared memory.
b. the 16384 Bytes are shared by all the grid blocks in all the multiprocessors, so each 8x8 thread group could use 16384 / NG / 16 bytes of shared memory.
c. each 8x8 thread group can use all 16384 bytes.
d. each multiprocessor has 16384 bytes. If one thread group eats up 16384 bytes, still 16 multiprocessor, each running one active group can work in parallel at a time.

Well, from the documentation I read, I feel like (a) is correct.
Please help me to clarify.

Hi there,

instead of choosing between a to d here is my opinion.
There is not 16k for all multiprocessors.

Each multiprocessor has 16K memory.

All threads running on this multiprocessor must therefore not exceed the 16K limit.

However, only threads of one common thread block can share data through shared memory. They cannot access shared memory of another block.

Why are you speaking of 8x8 threads.? Perhaps you want 256 or even 768 if its suits you.

So summing up… I think only D is right.

But beware of using 16384Byte. AFAIK it is really 16000Byte not 16*1024. All variables passed by the kernel call reside in shared memory and maybe some other stuff which placed the compiler there.
You can check that by looking into the *.cubin file (compile with -cubin instead of -c) and multiply the smem value with the amount of threads per block.
After that add eventually allocated dynamic shared memory to that.
Voila you have the usage per block and can determine how much blocks will run concurrently on one MP due to shared memory restraints.

Actually, I am using 3D 8x8x8 thread blocks since maximum is 512 threads per block. Why is 256 or 768 better or possible?

So the worst case is only 16 parallel jobs running on 16 multiprocessors concurrently…

IIRC, each processor can handle up to 768 live threads at the same time. If your blocks are size 512, you have 256 “wasted” threads in the multiproc. (This is a very rough description though.)

(384 is a nice size too.)

D is right. The amout of shared memory a given thread block uses will (among other things) dictate how many concurrant thread blocks can run on a given multiprocessor.

You all rocks!
This leads to the answer to the question why my program always have 67% occupancy only.

67% is a really good number in practice, you probably will not gain anything getting a higher occupancy (depends a bit on your problem).