I just start learning CUDA and I’m confuse with block/grid size. “standard” C programming with pthreads or fork() is not very difficult for me, but I don’t really understand the CUDA architecture. So I have few questions…
I don’t understand how to set the right block size, the right grid size and of course the right number of threads! Is this related to hardware and/or the application ?
Can we see GPU threads as CPU threads ? or the number of blocks as the number of CPU threads ?
Do you have information/papers/links “comparing” the CPU approach vs the GPU approach ?
The block and grid sizes depend on the one hand on your algorithm. On the other hand there are some restrictions due to hardware resources. Besides that you have to take care for optimal load on the GPU.
GPU and an x86 CPU are complete different architectutres. Thus CPU threads aren’t comparable to GPU threads at all.
I start to understand the CUDA logic…but I have a question :
Let’s assume we want to compute a matrix 16x48 (let’s say add 1 at each element). So my matrix has 768 elements. I also know, according to the cuda programming guide, that a block can only handle 512 threads max and blocks from a grid are distributed on multiprocessors (MP).
So if I have only 1 MP, nothing to think about, I launch my kernel has follow :
func <<< 1, dim3(16,48) >>>(...)
However, if I have a card with 30 MP do I need to launch :
func <<<1, dim3(16,32) >>> (..)
for better performances? By better performance I mean a faster resolution time. (I’m thinking of making blocks which have 512 threads each).