How many blocks shall I initialize?

Hi, I have a really simple question but I can’t seem to find any answers online so far. If I have a really big array, say 500000, is there a difference between

A. initializing nearly 1000 blocks with 512 threads per block

numThread = (N < 512 ? N : 512);
    numBlock = (N+numThread-1)/numThread;
    kernel<<<numBlock, numThread>>>(...);

B. initializing number of blocks fit the GPU (so for tesla C2075 is 28 or 42 if using 512 threads per block) and in the kernel do something like

while (tid < N) {
        tid += blockDim.x * gridDim.x;


I know this is a really basic question, but I just can’t find any useful answers else where…


Usually we set a fixed number of threads by block, a multiple of 32 as this is the wrap size, and we set the number of blocks according to the size of data (so the answer B).

There is no need to do a loop in your kernel. If you launch 977 blocks of 512 threads (500000/512 + 1) you will have 500224 threads.

It depends on your problem. I advice trying both