So how much shared mem do we really have ? knowing cuda hw better = better optimization

Hi guys.
I am just starting so please be patient with my noob questions.
But I must admit that I am kinda confused here

In doc is stated that each block of threads have its own independent shared mem.

For example on gtx260 as 1.3 rev. hw its 16k shared mem
this is acording to data from cuda programmin guide 2.3 appendix A (hw rev 1.0 The amount of shared memory available per multiprocessor is 16 KB organized into 16 banks. and no info for rev 1.3 )

And since you can have a lot of blocks so how much shared mem is there to use by different parallel blocks at once ?

In cuda 2.1 faq i found that for tesla hw 16k shared mem is per multiprocessor
and ptx_isa_1.4.pdf kinda confirms this by noting that shared mem is part of multiprocessor.
there is also statement that 1 block is run by 1 multiprocessor. and “A multiprocessor can execute as many as eight thread blocks concurrently”.

So my gtx260 have 27 multiprocessors

If each can execute 8 blocks and each of the 8 blocks share 16k mem = 27 * 16k = 432 kb fast shared mem total usable by max 216 bloks in parallel in 1 clock cycle where each block gets unique 2k shared mem? (16k on mp/8 blocks per core)

Does that mean that if we know in kernel what multiprocessor i am running that i can design whole algo to work with this 432 kb fast mem ?

If so. Next question is how to split threads to block. If limit per multiprocessor and block is 512 threads.
I assume if I set localwork size to 512 threads. than it will force scheduler to throw 1 block at 1 multiprocessor.
And would that mean that each block in range 1-27 runs in paralell and each have different shared 16k mem to use ? Or is some form of block syncing needed ?
How are blocks actually launched ? 27 kernels are loaded to 27 multiprocesstors and assumming the same operation in all threads they are all launched at once ?
If yes is it safe to assume that each 27-th block will therefore run on multiprocessor 0 and will be using his 16k shared mem ?

Also considering simple copy kernel a=b where all threads do the same

I also don’t know how the warp_size fits in. in whole doc is stated that warp is 32parallel threads executing the same instruction at once … but wait a minute … doesn’t one multiprocessor have just 8 scalar cores ?.
does that mean that scheduler just takes 32 threads with the same instruction and schedules 8 same instruction threads to run on 4 multiprocessors at once ?
therefore in my case 27 cores /4 = 6 warps executed in 1 clock ? So my card can run just 6 different instructions (although many parallel) per clock at once ?
But then there is note in doc that one multiprocessor can run max 32 warps at once ? What is warp then ? a thread ? I kinda lost here. Is there something like scheduler document ? or some gui showing how my code will be sheduled ?

By simple util check cuda-z reports 85gflops per 8600gt.
8600gt have 4 mp so 4 x 8 cores = 32 cores (threads ?) per clock x clock 1296ghz per kernel code = 41.472gflops. But cuda-z reports twice as much 82.506gflops. maybe they used mad instruction in benchmark which are 2 operations per clock. Also in cuda faq they calc gflops for tesla hw as cores x 3 x clock. what is the meaning of 3 is also unclear to me.

Also I can’t find anywhere in nvidia doc what is the texture memory cache size ( according to…2&p=6" 8k per multiprocessor? ) so 9 x 24k x 3 multiprocessors = 216 kb L1 Texture cache and more importantly 32k x 8 = 256 kb L2 Texture cache size ?
I am more confused than ever.

Thank you for your patience with me guys.
I just can’t wait for some clarification cos only then I can squeeze maximum of my card performance :D