newbie, microprocessors

I got a bit confused.

Each block will grab one microprocessor until competition, right?

but what if we have multiple threads and only one block, will the threads will be evenly distributed with in all the microprocessors, or only one and we will be waisting the other microprocessors?

When launching only one block only one multiprocessor will be used and your kernel will be rather slow.
To utilize the GPUs capacity you will need to run a lot of blocks.

So, how many blocks are a lot? How many threads are a lot?

Comparing same code on an 8600GT and 8800GT, with the block/grid dimensions parameterized, I don’t see a lot of variance with from 32 threads to 320 threads per block, and between 20 and 2 blocks. The number of total threads is constant in my test.

Thanks in advance!

I’m surprised you see little change when you go down to only 2 blocks on the 8800 GT. That GPU has 7 multiprocessors, which means that with only two blocks, 5/7 of the computational hardware is unused. What calculation does this kernel perform?

It is a test or ‘hello world’ type program. It does a blur operation on an RGB bitmap. Sum src pixel with it’s eight neighbours, multiply the sum by 0.1111111, store the result in a 2D output buffer. The range of difference for varying block size is within a factor of two. This is perhaps low arithmetic intensity code.

Honesty compels me to state that I am probably not making very intelligent use of shared memory, the data is read and written to global using uchar4 datatype. Shared mem is the next thing I’m going to try making use of. I’m somewhat new to CUDA but not to parallel or threaded programming. I read of the use of 1000’s of threads on this system and this still rather blows my mind. This was the real motivation behind my question of how many are a lot.

To come back to how many blocks and threads you need to run… Best is to have at least 2 blocks per multiprocessor with at least 96 threads per block. Keep in mind that the maximum number of threads in a block is 512 and the maximum number of threads per multiprocessor is 768.

Actually, this sounds like the perfect application of a 2D texture for the input bitmap.

A lot of blocks are a lot :) okay, I have spawned 20,000 to 30,000 blocks and have found that performance peaks. This is far better than spawning 128 blocks and run FOR loops inside. In my experience, running lot of blocks always has increased performance than small blocks running FOR loops…

And Gentlemen,

This is my 300th post. Aint I eligible for a standing ovation…? :D