in a very easy example shown below,
i understand (N+127)/128 is to launch enough blocks and while (tid < N) is to limit too many threads over N.
add <<<(N+127)/128,128>>> d_a, d_b, d_c);
global add (*a, *b, *c) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while (tid < N) {
… calculation …
}
}
my question is the following.

is this much different if i use if(tid<N) ? i initially though it must be no different…

for more complex dimension problem,
int i= threadIdx.x + blockIdx.x * blockDim.x;
int j= threadIdx.y + blockIdx.y * blockDim.y;
int k= threadIdx.z;
if(i<imax && j< jmax && k<kmax) { <<<<<<  can i limit three different int like this?
}

with above kernel, if i launch <<<(1,1,25),(140,140)>>> d_a, d_b, d_c); no problem
and <<<(2,2,25),( 70, 70)>>> d_a, d_b, d_c); no problem
and <<<(4,4,25),( 35, 35)>>> d_a, d_b, d_c); no problembut for <<<(3,3,25),(47,47)>>> d_a, d_b, d_c); i got “CUDA error: unknown error.” message.
i am guessing my “if” statement is not working…
is there any rule to limit excessive threads running using if or while statement?
any help is very appreciated and many thanks in advance.