launching and limiting non-multiples of thread #

in a very easy example shown below,

i understand (N+127)/128 is to launch enough blocks and while (tid < N) is to limit too many threads over N.

add <<<(N+127)/128,128>>> d_a, d_b, d_c);

global add (*a, *b, *c) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while (tid < N) {
… calculation …
}

}

my question is the following.

  1. is this much different if i use if(tid<N) ? i initially though it must be no different…

  2. for more complex dimension problem,

int i= threadIdx.x + blockIdx.x * blockDim.x;
int j= threadIdx.y + blockIdx.y * blockDim.y;
int k= threadIdx.z;
if(i<imax && j< jmax && k<kmax) { <<<<<< ----- can i limit three different int like this?

}

  1. with above kernel, if i launch <<<(1,1,25),(140,140)>>> d_a, d_b, d_c); no problem
    and <<<(2,2,25),( 70, 70)>>> d_a, d_b, d_c); no problem
    and <<<(4,4,25),( 35, 35)>>> d_a, d_b, d_c); no problem

    but for <<<(3,3,25),(47,47)>>> d_a, d_b, d_c); i got “CUDA error: unknown error.” message.

i am guessing my “if” statement is not working…

is there any rule to limit excessive threads running using if or while statement?

any help is very appreciated and many thanks in advance.

in a very easy example shown below,

i understand (N+127)/128 is to launch enough blocks and while (tid < N) is to limit too many threads over N.

add <<<(N+127)/128,128>>> d_a, d_b, d_c);

global add (*a, *b, *c) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while (tid < N) {
… calculation …
}

}

my question is the following.

  1. is this much different if i use if(tid<N) ? i initially though it must be no different…

  2. for more complex dimension problem,

int i= threadIdx.x + blockIdx.x * blockDim.x;
int j= threadIdx.y + blockIdx.y * blockDim.y;
int k= threadIdx.z;
if(i<imax && j< jmax && k<kmax) { <<<<<< ----- can i limit three different int like this?

}

  1. with above kernel, if i launch <<<(1,1,25),(140,140)>>> d_a, d_b, d_c); no problem
    and <<<(2,2,25),( 70, 70)>>> d_a, d_b, d_c); no problem
    and <<<(4,4,25),( 35, 35)>>> d_a, d_b, d_c); no problem

    but for <<<(3,3,25),(47,47)>>> d_a, d_b, d_c); i got “CUDA error: unknown error.” message.

i am guessing my “if” statement is not working…

is there any rule to limit excessive threads running using if or while statement?

any help is very appreciated and many thanks in advance.

None of those launches should work, you have the block and grid dimensions reversed in all of those kernel launch statements, and the grid and block dimensions as written are illegal…

None of those launches should work, you have the block and grid dimensions reversed in all of those kernel launch statements, and the grid and block dimensions as written are illegal…

sorry, i put it in wrong order.

and of course i defined grids and blocks using dim3.

imax=139; jmax=139; kmax=24;

int block_x = 2, block_y = 2, block_z = kmax + 1;

int grid_x = (imax+1)/block_x + ((imax+1)%block_x == 0?0:1) + 1;

int grid_y = (jmax+1)/block_y + ((jmax+1)%block_y == 0?0:1) + 1;

dim3 dimGrid2(grid_x,grid_y),dimBlock2(block_x,block_y,block_z);

kernel<<<dimGrid2,dimBlock2>>> (…,imax,jmax,kmax,…) <<this is actual calling statement for kernel.

in this setting, if i lunch with “int block_x = 1, block_y = 1, block_z = kmax + 1”

                                    <<<(140,140),(1,1,25)>>> (...,imax,jmax,kmax,....);     no problem

and with “int block_x = 2, block_y = 2, block_z = kmax + 1”

                                    <<<( 70,  70),(2,2,25)>>> (...,imax,jmax,kmax,....);     no problem

and with “int block_x = 4, block_y = 4, block_z = kmax + 1”

                                    <<<( 35,  35),(4,4,25)>>> (...,imax,jmax,kmax,....);     no problem

(again i am showing actual numbers in <<< >>>)

in fact, i don’t even need any “if” statements in the kernel for three cases shown above…

however, if i want to launch with “int block_x = 3, block_y = 3, block_z = kmax + 1”

                                             <<<(47,47),(3,3,25)>>> (...,imax,jmax,kmax,....) ;  

there are too many threads launched so i need to idle a few of them in the kernel using “if” or “while” statement.

global kernel ( ) {

int i= threadIdx.x + blockIdx.x * blockDim.x;

int j= threadIdx.y + blockIdx.y * blockDim.y;

int k= threadIdx.z;

if(i<=imax && j<= jmax && k<=kmax) { <<<<<< ----- can i limit three different int like this?

                .......

}

}

and somehow CUDA gives me ‘unknown error’ message for

                                      the case of <<<(47,47),(3,3,25)>>> (...,imax,jmax,kmax,....) ;  

is there any rule to limit excessive threads running using “if” or “while” statement?

any help is very appreciated and many thanks in advance.

[/quote]

sorry, i put it in wrong order.

and of course i defined grids and blocks using dim3.

imax=139; jmax=139; kmax=24;

int block_x = 2, block_y = 2, block_z = kmax + 1;

int grid_x = (imax+1)/block_x + ((imax+1)%block_x == 0?0:1) + 1;

int grid_y = (jmax+1)/block_y + ((jmax+1)%block_y == 0?0:1) + 1;

dim3 dimGrid2(grid_x,grid_y),dimBlock2(block_x,block_y,block_z);

kernel<<<dimGrid2,dimBlock2>>> (…,imax,jmax,kmax,…) <<this is actual calling statement for kernel.

in this setting, if i lunch with “int block_x = 1, block_y = 1, block_z = kmax + 1”

                                    <<<(140,140),(1,1,25)>>> (...,imax,jmax,kmax,....);     no problem

and with “int block_x = 2, block_y = 2, block_z = kmax + 1”

                                    <<<( 70,  70),(2,2,25)>>> (...,imax,jmax,kmax,....);     no problem

and with “int block_x = 4, block_y = 4, block_z = kmax + 1”

                                    <<<( 35,  35),(4,4,25)>>> (...,imax,jmax,kmax,....);     no problem

(again i am showing actual numbers in <<< >>>)

in fact, i don’t even need any “if” statements in the kernel for three cases shown above…

however, if i want to launch with “int block_x = 3, block_y = 3, block_z = kmax + 1”

                                             <<<(47,47),(3,3,25)>>> (...,imax,jmax,kmax,....) ;  

there are too many threads launched so i need to idle a few of them in the kernel using “if” or “while” statement.

global kernel ( ) {

int i= threadIdx.x + blockIdx.x * blockDim.x;

int j= threadIdx.y + blockIdx.y * blockDim.y;

int k= threadIdx.z;

if(i<=imax && j<= jmax && k<=kmax) { <<<<<< ----- can i limit three different int like this?

                .......

}

}

and somehow CUDA gives me ‘unknown error’ message for

                                      the case of <<<(47,47),(3,3,25)>>> (...,imax,jmax,kmax,....) ;  

is there any rule to limit excessive threads running using “if” or “while” statement?

any help is very appreciated and many thanks in advance.

[/quote]