I just start learning CUDA and I’m confuse with block/grid size. “standard” C programming with pthreads or fork() is not very difficult for me, but I don’t really understand the CUDA architecture. So I have few questions…
I don’t understand how to set the right block size, the right grid size and of course the right number of threads! Is this related to hardware and/or the application ?
Can we see GPU threads as CPU threads ? or the number of blocks as the number of CPU threads ?
Do you have information/papers/links “comparing” the CPU approach vs the GPU approach ?
The block and grid sizes depend on the one hand on your algorithm. On the other hand there are some restrictions due to hardware resources. Besides that you have to take care for optimal load on the GPU.
GPU and an x86 CPU are complete different architectutres. Thus CPU threads aren’t comparable to GPU threads at all.
I start to understand the CUDA logic…but I have a question :
Let’s assume we want to compute a matrix 16x48 (let’s say add 1 at each element). So my matrix has 768 elements. I also know, according to the cuda programming guide, that a block can only handle 512 threads max and blocks from a grid are distributed on multiprocessors (MP).
So if I have only 1 MP, nothing to think about, I launch my kernel has follow :
func <<< 1, dim3(16,48) >>>(...)
However, if I have a card with 30 MP do I need to launch :
func <<<1, dim3(16,32) >>> (..)
for better performances? By better performance I mean a faster resolution time. (I’m thinking of making blocks which have 512 threads each).
I’m creating 1 grid (one by one) containing X blocks 16x48x1 isn’t it ? Where X is determined by…I don’t know =) How can I know ? does it depend on the data to be processed ?
I mean : if I want to processed a 768 elements array and I launch the kernel as follow :
Once you have an understanding of kernel execution parameters, GPU architecture and CUDA as a whole, then I’d suggest looking into changing your block/grid dimensions…