Well first of all I’m in my first course of GPU programming and well I have some questions.
First I’ve been reading the “CUDA BY EXAMPLE BOOK” and for the vector addition I saw the author uses many blocks with many threads, for instance 128 blocks and 128 threads/block.
add <<<128, 128>>>(param1, param2, param3)
I have a MacBook Pro with a GeForce 320M graphics card and now, in the course the professor told me that this graphic card has:
6 multiprocessors and
48 cores (which means 8 cores per multiprocessor)
Now when we run CUDA applications like vector addition he also told us that is better just to use 6 BLOCKS ( = #multiprocessors) and 8 THREADS ( = #cores/multiprocessor). This came from the idea that with this numbers I will be using (and exploiting) all the power of the GPU. This comes from the idea that each multiprocessor (in the GPU) has its own STREAMING-MULTIPROCESSOR which is the one that controls the data flow on the cores, and each SM manages just 8 cores at a time (at least in my grapic card).
add<<<6,8>>>(param1, param2, param3)
Having said all this stuff, my question is:
Â¿Why the author (of CUDA BY EXAMPLE) uses many blocks and many threads and my professor says that is better to use just what we have in the GPU?
Maybe I misunderstood something about the Streaming-multiprocessors or something, just need some explanations from more experienced people.