exceeding available threads

Hello,

I am totally new to cuda programming, so I apologize if this topic seems obsolete to some of you…

My pc is running on a 9300M GS, which has a single multiprocessor… so I was converting this code of mine, that uses an array of 800 elements, and I thought that “yeah… we don’t have as many available threads (1multiprocessor -> 768 threads) so I will use 400 threads and have them do the job twice”…

but then, only out of curiosity I run the code assuming I had 1024 theads, having them do the job once (invoking a kernel with <<< 4 blocks, 256 threads/block >>>) and it worked. Then I realized that the vectoradd example of the SDK assumes there are many more threads and works just fine.

From all this I come to understand, that when a block finishes its task, it emulates a block that can’t exist due to the lack of threads… I don’t know… does this sound silly??? could someone please clarify this for me?? I’ve searched the programming guide, but this is not made clear anywhere…

thank you very much in advance

Blocks only have a finite life. When a block finishes execution, it is retired. When all the blocks active on an MP are retired (there can be more than one), the MP is idle. When the MP goes idle, a set of new blocks are scheduled and the process repeats until every block is finished. Then the kernel is done.

This is how pre-Fermi cards operate, correct? What about for Fermi? Do they have a more optimized block scheduler?

thank you very much for your answer!

the reality sounds slower than the scenario I was guessing…