Which entity will execute one block? A single Cuda core or a SM?

I am running Cuda 3.2 on my Fedora 13 box on GTX 470 (compute ability 2.0). The device query says the following:

Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768

I need to compute using my graphics card on data chunks of sizes 32 KB each (I know it will run quite slower). Now I know that I cannot create even two threads inside one block. I can just create one thread per block. But I think I would be able to launch 256 blocks in the grid as the maximum number in both X and Y dimensions is way beyond 16.

Now the question is, does the card schedule one block to one cuda core or to one SM on chip?

The answer to your question is that blocks are scheduled on multiprocessors, not cores. But you might want to read some documentation, because just about everything in the paragraph above your question is incorrect.

thanks for replying. I have gone through the Cuda By Example book as well as the C Programming Guide.

I would like to know what was wrong (I think that it is correct), because that might affect my program in a great way!

You can create more than one thread per block. You actually want to create multiples of 32 # of threads per block. So, 256 threads per block is already possible. And then you also will want to use lots of blocks.

I wrote that because the devicequery says the following:

Total amount of shared memory per block: 49152 bytes

And I have to compute on data blocks of about 32 KB each!

OK then:

Your Fermi card can be configured to provide 48kb of shared memory per block, so that should be no problem if your code uses coalesced memory access to load the 32kb chunk to shared memory first. While 1 block per MP is not ideal, it need not be slow, depending on how many threads per block you use and how much instruction level parallelism your code contains.

That is almost certainly not true. Having established that the shared memory usage of your kernel will limit you to one block per multiprocessor, you should be able to have at least 512 threads per block if it makes sense to do so (32768 total registers with a maximum of 64 registers per thread).

The “X and Y dimensions” of what? If your algorithm demands your process input data in 32kb chunks, then you probably want to launch as many blocks as there are 32kb chunks in the input space, otherwise launch as many blocks as there are multiprocessors on your GPU and have each block process many 32kb chunks of imput space rather than just one.

That is shared memory per block. You can (and should) use many threads to load from global memory to that shared memory, and then many threads to process the contents of that shared memory, and then many threads to write it back to global memory. That is the CUDA parallel execution model.

I am not able to understand the above sentence. can you please elaborate please? Also, please tell me where I am wrong in the following paragraph.

The problem is that there are entities in that 32 KB chunk which are random sized. I cannot just add a line like next_pos = threadIdx.x*some_var because that would likely make me go to an invalid data boundary within that chunk. So I will have to consider those 32KB chunk as one unit and make each thread go through those chunks in some serial fashion to reach the data which I need (and that is usually somewhere near the end of the chunk). For going through the chunks in serial fashion, I would need to have them in the shared memory of the block to which the thread belongs. (Am I wrong here?)

Now if I launch the kernel with more than one thread per block, then I guess there will be some 'Global shared" to “block shared” memory traffic for each thread which wants to read data (Am I wrong here?) which would likely create overhead enough to make the ‘multi threaded blocks’ slower than ‘Single threaded blocks’. So, may be I will have to do away with one thread per block?

X and Y dimensions of the Grid. And yes, the input data will always be in 32 KB chunks. May be a bit more high (but less than the 48 KB limit) or a bit less (but not as less as 1 or 2 KB) so considering 32 KB is the safe side here. now, say I have 252 such chunks of data (I have 14 MPs) then I can have the launch as :

my_kernel <<<252, 1>> (parameters)

or ,

my_kernel <<<14,18>> (parameters)

I think the second launch will be slower than the first when run due to the limited shared memory per block. (Am I wrong again?)

I am confused with the way launch is done on GPU so there are so many questions.

Perhaps the following question is more simplified than the above:

Can one MP run more than one block at one time from the same kernel?

If I know the answer to the above question then I think the problem is much simplified for me !

Anyways thanks for your help so far! :)

Of course. All cuda hardware supports up to 8 simultaneously active blocks per multiprocessor (resources permitting).

^^ Great !! So I can easily launch 14 * 8 = 112 blocks which is just great for me! thanks a lot. I had searched for thin in the books and the docs but could not find that thing. Now I think I would start writing the code and test the thing later with multiple threads as well.

Really thanks a lot :)

That limit has nothing to do with how many block you can use in a single kernel launch. You can always have up to 65536 * 65536 blocks per launch.

Appendix G of the CUDA programming guide.

I knew that thing! What I did not know was that each MP can simultaneously execute 8 Blocks.

Thanks for the reference but I still don’t remember where it was written. I am unable to find it even after re-reading. In which section have they mentioned the 8 block / MP fact? :|

It is in the second table.

Yes, I had missed it! Thanks a lot avid. :)