calling cudaThreadSynchronize from a kernel

Hello,

Is it possible to call cudaThreadSynchronize from a CUDa kernel?

I have a CUDA kernel that handles a big loop in parallel and then I would like it to call another kernel which would handle a subset of the data (this kernel might in turn call another kernel) and I would like to use cudaThreadSynchronize to ensure that all the kernels finish before we go on.

Thanks and sorry for the newbie question.

xarg

You can’t launch a kernel from within a kernel. Note that on current hardware only ever one kernel can be executing on the GPU at a time.

Thanks for the reply jjp! It’s a shame though about this limitation.

So, if I have something like this:

[codebox]

for (int i = 0; i < someZ; ++i)

{

for (int k = 0; k < someZ; ++k)

{

     for (int j = 0; j < someZ; ++j)

     {

          // Some processing here

          for (int x = 0; x < someX; ++x)

          {

                for (int y = 0; y < someY; ++y)

                {

                      // Some processing here

                }

           }

       }

  }

}

[/codebox]

So, in this scenario, I can have a kernel handling the outside 3 loops in parallel but the inside 3 loops will have to be executed by each thread, right? How does one currently handle these kind of situations?

Many thanks,

xarg

programming with CUDA is different from programming with OpenMP. You can issue nested OpenMP directives, like

#pragma omp parallel for …

for( i=1; i <= nx ; i++){

#pragma omp parallel for …

for( j=1; j <= ny ; j++){

}

}

However with CUDA, you cannot invoke a kernel function inside a kernel function, since hardware will

schedule all blocks you configure in one kernel function.

What you can do is to parallel whole loop

for( i=1; i <= nx ; i++){

x[i] = …

for( j=1; j <= ny ; j++){

y[j] = …

}

[codebox]global void foo( float x, floaty, int nx, int ny )

{

unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;

unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;

if((xIndex < ny) && (yIndex < nx)) {

   x[yIndex] = ...

   y[xIndex] = ...

}

}

int main()

{

float *x, *y ;

int nx, ny ;

dim3 threads(BLOCK_DIM, BLOCK_DIM, 1);

dim3 grid( (ny+BLOCK_DIM-1) / BLOCK_DIM, (nx+BLOCK_DIM-1) / BLOCK_DIM, 1);

foo<<< grid, threads >>>( x, y, nx, ny);

}

[/codebox]

Thanks for the reply. However, how about the scenario I mentioned in my last reply? I can do the outer 3 loops in parallel and do the required processing but then the inner two loops cannot be made parallel, right?

If I could launch another kernel, I could have handled the outer 3 loops in my top kernel and then launch another kernel with the updated parameters to handle the inner two loops or have I missed something here (quite likely!)

Thanks,

xarg

for your question

[codebox]for (int i = 0; i < someZ; ++i){

for (int k = 0; k < someZ; ++k){         

	for (int j = 0; j < someZ; ++j){              

		// Some processing here              

		for (int x = 0; x < someX; ++x){                    

			for (int y = 0; y < someY; ++y){                          

			// Some processing here                    

			}// for y               

		}// for x           

	}// for j      

}// for k

}// for i

[/codebox]

I think your problem is “how to use grid configuration to label 3 nested-loop”

my approach is to access data element slice by slice

for example: if I have a 3D data X(0:n1-1, 0:n2-1, 0:n3-1) and I want to do transpose operation,

say X(i,j,k) → Y(j,k,i) where Y(0:n2-1, 0:n3-1, 0:n1-1).

Then first I define a square block (for example, dim3 threads(16, 16, 1) ), then

I want to cut data to x-z slice, say

one x-zslice (y is fixed) require M = Gx * Gz grids, where Gx = ceil(n1/16), Gz = ceil(n3/16)

then total number of grids required is M * n2.

Finially I organize M * n2 grids into 2D configuration.

for exmaple: find k1, k2 such that k1 * k2 - n2 <= 1, then issue dim3 grid( k2Gz, k1Gx, 1 )

the pseudo-code is

[codebox]void foo( doublereal *X, unsigned int n1, unsigned int n2, unsigned int n3 )

{

unsigned int Gx, Gz, k1, k2 ;

double db_n2 = (double) n2 ;

/* Gx = number of grids need in x-axis

  • Gz = number of grids need in z-axis

  • we call a coarse grid is compose of grid Gx x Gz

*/

Gx = (n1 + BLOCK_DIM-1) / BLOCK_DIM ; 

Gz = (n3 + BLOCK_DIM-1) / BLOCK_DIM ; 

/*

  • since a coarse can cover a x-z slice, we need n2 corase grids to cover X

  • in order to save resource, we want to find two integers k1, k2 such that

  • k1 * k2 - n2 <= 1

  • for example:

  •  n2 = 7   ==> k1 = 2 and k2 = 4
    
  • n2 = 13 ==> k2 = 2 and k2 = 7

*/

int max_k1 = (int) floor( sqrt(db_n2) ) ;

for ( k1 = max_k1 ; 1 <= k1 ; k1-- ){

	k2 = (unsigned int) ceil( db_n2/((double)k1)) ;

	if ( 1 >= (k1*k2 - n2) ){

		break ;

	}

}

dim3 threads(BLOCK_DIM, BLOCK_DIM, 1);

dim3 grid( k2*Gz, k1*Gx, 1 ); 

foo_kernel<<< grid, threads >>>( X, n1, n2, n3, Gx, Gz, k2 ) ;

}[/codebox]

[codebox]global void foo_kernel( float *X,

		unsigned int n1, unsigned int n2, unsigned int n3, 

		unsigned int Gx, unsigned int Gz, unsigned int k2 )

{

float tmp1, tmp2 ; 

unsigned int s1, s2, t1, t2, xIndex, yIndex, zIndex, index_in ;

/* step 1: transform grid index to 3D corase grid index

  • blockIdx.x = Gz * s1 + t1

  • blockIdx.y = Gx * s2 + t2

  • where (s1, s2): index to y-direction, (t1, t2): index to x-z slice (local blockID )

  • s1 = floorf( blockIdx.x / Gz )

  • t1 = blockIdx.x - Gz*s1

  • s2 = floorf( blockIdx.y / Gx )

  • t2 = blockIdx.y - Gx*s2

*/

tmp1 = __uint2float_rz( blockIdx.x ) ;

tmp2 = __uint2float_rz( Gz ) ;

tmp1 = floorf( tmp1 / tmp2 ) ;

s1 = __float2uint_rz( tmp1 ) ; 

t1 = blockIdx.x - Gz*s1 ;



tmp1 = __uint2float_rz( blockIdx.y ) ;

tmp2 = __uint2float_rz( Gx ) ;

tmp1 = floorf( tmp1 / tmp2 ) ;

s2 = __float2uint_rz( tmp1 ) ; 

t2 = blockIdx.y - Gx*s2 ;

// step 2: extract index

yIndex = s2*k2 + s1 ;

zIndex = t1 * BLOCK_DIM + threadIdx.x ;

xIndex = t2 * BLOCK_DIM + threadIdx.y ;

if ( (yIndex < n2) && (xIndex < n1) && (zIndex < n3)  ){

	index_in = (xIndex * n2 + yIndex) * n3 + zIndex ; 

	for (int x = 0; x < someX; ++x){                    

			for (int y = 0; y < someY; ++y){                          

			// processing data element X(i,j,k) = X[index_in]              

			}// for y               

		}// for x     

}

}

[/codebox]

if you can map each thread to index of outer 3 loop, then you just need one kernel function and

handle remaining 2 loops inside this kernel function.

above is just a framework, Maybe you can write operation in your inner 2 loops and then we can see an approach to do that

Thanks for your detailed reply. I was doing it as follows:

[codebox]

const unsigned block_size = 320;

// Calculate the size of the grid needed to process all the blocks

float numParams = sizeX * sizeY * sizeZ;

const unsigned grid_size = (unsigned)(ceil(numParams/(float)(block_size)));

// Set up our grid and thread blocks

dim3 block(block_size, 1, 1);

dim3 grid(grid_size, 1, 1);

[/codebox]

However, in my kernel I am having problems trying to figure out how to index everything correctly. So, my original CPU code was:

[codebox]

for (int i = 0; i < sizeX; ++i)

{

for (int j = 0; j < sizeY; ++j)

{

     for (int k = 0; k < sizeZ; ++k)

     {              

     }

}

}

[/codebox]

What would be the corresponding i. j, k values in the kernel code for the thread if I have set it up as I described before?

Cheers,

xarg

Never mind this. I figured it out.