Streaming multiprocessors and threads on GPU Understanding about Streaming-Multiprocessors, Threads,

Hello,

Well first of all I’m in my first course of GPU programming and well I have some questions.

First I’ve been reading the “CUDA BY EXAMPLE BOOK” and for the vector addition I saw the author uses many blocks with many threads, for instance 128 blocks and 128 threads/block.

add <<<128, 128>>>(param1, param2, param3)

I have a MacBook Pro with a GeForce 320M graphics card and now, in the course the professor told me that this graphic card has:

    6 multiprocessors and

    48 cores (which means 8 cores per multiprocessor)

Now when we run CUDA applications like vector addition he also told us that is better just to use 6 BLOCKS ( = #multiprocessors) and 8 THREADS ( = #cores/multiprocessor). This came from the idea that with this numbers I will be using (and exploiting) all the power of the GPU. This comes from the idea that each multiprocessor (in the GPU) has its own STREAMING-MULTIPROCESSOR which is the one that controls the data flow on the cores, and each SM manages just 8 cores at a time (at least in my grapic card).

add<<<6,8>>>(param1, param2, param3)

Having said all this stuff, my question is:

¿Why the author (of CUDA BY EXAMPLE) uses many blocks and many threads and my professor says that is better to use just what we have in the GPU?

Maybe I misunderstood something about the Streaming-multiprocessors or something, just need some explanations from more experienced people.

Regards !

If this really is what your professor told you then you should not trust him.

Please read the Programming Guide. It is very accessible and gives all information necessary to write good Cuda programs. It has a full chapter dedicated to the hierarchy of kernels, multiprocessors, blocks, warps, and threads, and gives good advice on choosing block size.

I’ve also summed up a few rules about choosing the blocksize in a recent forum post. In short, on your Macbook Pro you should use a blocksize that is a multiple of 32 and at least 192. Using a blocksize of 8 you waste probably more than 90% of the ressources of it’s GPU.

Thank you very much for your quickly answer. Well ok I’ll read the Programming guide and try to understand completely how to exploit as much as I can the power of the GPU.

“Keeping dimBlock.x a multiple of 16 often gives you good coalesced memory access patterns.”

I believe this should be 32 on Fermi devices since they are not accessed by half-warps but by whole warps.

"
On devices of compute capability 1.x, global memory accesses are processed per half-warp;
on devices of compute capability 2.x, they are processed per warp. Adjusting kernel launch Tuning CUDA Applications for Fermi 4
configurations that assume per-half-warp accesses might therefore improve performance.
Two-dimensional thread blocks, for example, should have their x-dimension be a multiple of
the warp size as opposed to half the warp size so that each warp addresses a single cache line
when accessing global memory"

Thanks again for the asnwers. Now reading a little bit a new question came to my mind.

To exploit the most all the capabilities of the GPU what is what I need to take into account? I mean, do I need just care about the number of threads launched?

That’s true. I tried to keep it simple and thought it’s not worth mentioning Fermi here as the cache will help mitigate bad memory access patterns there. However I’ll edit the post accordingly.

The next thing to worry about would be memory access. Try to arrange it so that memory accesses from half-warps go into the same aligned 64-byte memory segment (or, if that cannot be achieved, as few as possible). Avoid reloading the same data by keeping it in registers. If different threads access the same data, copy it to shared memory first and load from there. Use textures if a traditional cache would be beneficial.

There’s a lot more to say here. Again, the Programming Guide has all the information necessary.

Thanks a lot.