Hi, I have a really simple question but I can’t seem to find any answers online so far. If I have a really big array, say 500000, is there a difference between
A. initializing nearly 1000 blocks with 512 threads per block
numThread = (N < 512 ? N : 512);
numBlock = (N+numThread-1)/numThread;
kernel<<<numBlock, numThread>>>(...);
B. initializing number of blocks fit the GPU (so for tesla C2075 is 28 or 42 if using 512 threads per block) and in the kernel do something like
while (tid < N) {
tid += blockDim.x * gridDim.x;
}
?
I know this is a really basic question, but I just can’t find any useful answers else where…
Thanks