How to use blocks

I know that the maximum blocks that can run concurrently on a multiprocessor is 8.
So for a 8800 GTX, we can run concurrently 64 blocks.
I just want to know the limit of blocks we can create (I suppose there is a limit).

And I have a question.
If we have two cases (on a 8800 GTX) :
1 - we run an application on 128 blocks with 1 thread per block
2 - we run this same application on 4 blocks with 32 threads

In my mind, the second case will run more rapidly because it’s using warps. But I am not sur and I need an explanation. Maybe it depends on something else …

Thanks for your futur answers.

You need to read chapters 2, 3 and appendix A of the Cuda Programming guide. Although it may take more than one reading to fully grasp what it tells you. 1 thread per block is a non-starter If you want performance it would only use 8 of the 128 processors and probably leave those processor idle at least 75 % of the time.

A warp is 32 threads the mininum number to keep a multiprocessor fully utilized with no memory access. So to fully utilize a 8899 GTX a minimum of 256 trheads organized as 8 blocks or one block per multiproccessor. To hde memory accesses you nead many more see the Performance Guidlines in Chapter 5.