calling cudaThreadSynchronize from a kernel

xargon · September 2, 2009, 8:55am

Hello,

Is it possible to call cudaThreadSynchronize from a CUDa kernel?

I have a CUDA kernel that handles a big loop in parallel and then I would like it to call another kernel which would handle a subset of the data (this kernel might in turn call another kernel) and I would like to use cudaThreadSynchronize to ensure that all the kernels finish before we go on.

Thanks and sorry for the newbie question.

xarg

jjp · September 2, 2009, 9:04am

You can’t launch a kernel from within a kernel. Note that on current hardware only ever one kernel can be executing on the GPU at a time.

xargon · September 2, 2009, 9:16am

Thanks for the reply jjp! It’s a shame though about this limitation.

So, if I have something like this:

[codebox]

for (int i = 0; i < someZ; ++i)

{

for (int k = 0; k < someZ; ++k)

{

     for (int j = 0; j < someZ; ++j)

     {

          // Some processing here

          for (int x = 0; x < someX; ++x)

          {

                for (int y = 0; y < someY; ++y)

                {

                      // Some processing here

                }

           }

       }

  }

}

[/codebox]

So, in this scenario, I can have a kernel handling the outside 3 loops in parallel but the inside 3 loops will have to be executed by each thread, right? How does one currently handle these kind of situations?

Many thanks,

xarg

LSChien · September 2, 2009, 9:29am

programming with CUDA is different from programming with OpenMP. You can issue nested OpenMP directives, like

#pragma omp parallel for …

for( i=1; i <= nx ; i++){

#pragma omp parallel for …

for( j=1; j <= ny ; j++){

}

However with CUDA, you cannot invoke a kernel function inside a kernel function, since hardware will

schedule all blocks you configure in one kernel function.

What you can do is to parallel whole loop

for( i=1; i <= nx ; i++){

x[i] = …

for( j=1; j <= ny ; j++){

y[j] = …

}

[codebox]global void foo( float x, floaty, int nx, int ny )

{

unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;

unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;

if((xIndex < ny) && (yIndex < nx)) {

   x[yIndex] = ...

   y[xIndex] = ...

}

int main()

{

float *x, *y ;

int nx, ny ;

…

dim3 threads(BLOCK_DIM, BLOCK_DIM, 1);

dim3 grid( (ny+BLOCK_DIM-1) / BLOCK_DIM, (nx+BLOCK_DIM-1) / BLOCK_DIM, 1);

foo<<< grid, threads >>>( x, y, nx, ny);

}

[/codebox]

xargon · September 2, 2009, 9:37am

Thanks for the reply. However, how about the scenario I mentioned in my last reply? I can do the outer 3 loops in parallel and do the required processing but then the inner two loops cannot be made parallel, right?

If I could launch another kernel, I could have handled the outer 3 loops in my top kernel and then launch another kernel with the updated parameters to handle the inner two loops or have I missed something here (quite likely!)

Thanks,

xarg

LSChien · September 2, 2009, 9:54am

for your question

[codebox]for (int i = 0; i < someZ; ++i){

for (int k = 0; k < someZ; ++k){         

	for (int j = 0; j < someZ; ++j){              

		// Some processing here              

		for (int x = 0; x < someX; ++x){                    

			for (int y = 0; y < someY; ++y){                          

			// Some processing here                    

			}// for y               

		}// for x           

	}// for j      

}// for k

}// for i

[/codebox]

I think your problem is “how to use grid configuration to label 3 nested-loop”

my approach is to access data element slice by slice

for example: if I have a 3D data X(0:n1-1, 0:n2-1, 0:n3-1) and I want to do transpose operation,

say X(i,j,k) → Y(j,k,i) where Y(0:n2-1, 0:n3-1, 0:n1-1).

Then first I define a square block (for example, dim3 threads(16, 16, 1) ), then

I want to cut data to x-z slice, say

one x-zslice (y is fixed) require M = Gx * Gz grids, where Gx = ceil(n1/16), Gz = ceil(n3/16)

then total number of grids required is M * n2.

Finially I organize M * n2 grids into 2D configuration.

for exmaple: find k1, k2 such that k1 * k2 - n2 <= 1, then issue dim3 grid( k2Gz, k1Gx, 1 )

the pseudo-code is

[codebox]void foo( doublereal *X, unsigned int n1, unsigned int n2, unsigned int n3 )

{

unsigned int Gx, Gz, k1, k2 ;

double db_n2 = (double) n2 ;

/* Gx = number of grids need in x-axis

Gz = number of grids need in z-axis
we call a coarse grid is compose of grid Gx x Gz

*/

Gx = (n1 + BLOCK_DIM-1) / BLOCK_DIM ; 

Gz = (n3 + BLOCK_DIM-1) / BLOCK_DIM ;

/*

since a coarse can cover a x-z slice, we need n2 corase grids to cover X
in order to save resource, we want to find two integers k1, k2 such that
k1 * k2 - n2 <= 1
for example:
```
 n2 = 7   ==> k1 = 2 and k2 = 4
```
n2 = 13 ==> k2 = 2 and k2 = 7

*/

int max_k1 = (int) floor( sqrt(db_n2) ) ;

for ( k1 = max_k1 ; 1 <= k1 ; k1-- ){

	k2 = (unsigned int) ceil( db_n2/((double)k1)) ;

	if ( 1 >= (k1*k2 - n2) ){

		break ;

	}

}

dim3 threads(BLOCK_DIM, BLOCK_DIM, 1);

dim3 grid( k2*Gz, k1*Gx, 1 ); 

foo_kernel<<< grid, threads >>>( X, n1, n2, n3, Gx, Gz, k2 ) ;

}[/codebox]

[codebox]global void foo_kernel( float *X,

		unsigned int n1, unsigned int n2, unsigned int n3, 

		unsigned int Gx, unsigned int Gz, unsigned int k2 )

{

float tmp1, tmp2 ; 

unsigned int s1, s2, t1, t2, xIndex, yIndex, zIndex, index_in ;

/* step 1: transform grid index to 3D corase grid index

blockIdx.x = Gz * s1 + t1
blockIdx.y = Gx * s2 + t2
where (s1, s2): index to y-direction, (t1, t2): index to x-z slice (local blockID )
s1 = floorf( blockIdx.x / Gz )
t1 = blockIdx.x - Gz*s1
s2 = floorf( blockIdx.y / Gx )
t2 = blockIdx.y - Gx*s2

*/

tmp1 = __uint2float_rz( blockIdx.x ) ;

tmp2 = __uint2float_rz( Gz ) ;

tmp1 = floorf( tmp1 / tmp2 ) ;

s1 = __float2uint_rz( tmp1 ) ; 

t1 = blockIdx.x - Gz*s1 ;



tmp1 = __uint2float_rz( blockIdx.y ) ;

tmp2 = __uint2float_rz( Gx ) ;

tmp1 = floorf( tmp1 / tmp2 ) ;

s2 = __float2uint_rz( tmp1 ) ; 

t2 = blockIdx.y - Gx*s2 ;

// step 2: extract index

yIndex = s2*k2 + s1 ;

zIndex = t1 * BLOCK_DIM + threadIdx.x ;

xIndex = t2 * BLOCK_DIM + threadIdx.y ;

if ( (yIndex < n2) && (xIndex < n1) && (zIndex < n3)  ){

	index_in = (xIndex * n2 + yIndex) * n3 + zIndex ; 

	for (int x = 0; x < someX; ++x){                    

			for (int y = 0; y < someY; ++y){                          

			// processing data element X(i,j,k) = X[index_in]              

			}// for y               

		}// for x     

}

}

[/codebox]

if you can map each thread to index of outer 3 loop, then you just need one kernel function and

handle remaining 2 loops inside this kernel function.

above is just a framework, Maybe you can write operation in your inner 2 loops and then we can see an approach to do that

xargon · September 2, 2009, 10:51am

Thanks for your detailed reply. I was doing it as follows:

[codebox]

const unsigned block_size = 320;

// Calculate the size of the grid needed to process all the blocks

float numParams = sizeX * sizeY * sizeZ;

const unsigned grid_size = (unsigned)(ceil(numParams/(float)(block_size)));

// Set up our grid and thread blocks

dim3 block(block_size, 1, 1);

dim3 grid(grid_size, 1, 1);

[/codebox]

However, in my kernel I am having problems trying to figure out how to index everything correctly. So, my original CPU code was:

[codebox]

for (int i = 0; i < sizeX; ++i)

{

for (int j = 0; j < sizeY; ++j)

{

     for (int k = 0; k < sizeZ; ++k)

     {              

     }

}

}

[/codebox]

What would be the corresponding i. j, k values in the kernel code for the thread if I have set it up as I described before?

Cheers,

xarg

xargon · September 2, 2009, 11:38am

Never mind this. I figured it out.

Thanks for your detailed reply. I was doing it as follows:

[codebox]

const unsigned block_size = 320;

// Calculate the size of the grid needed to process all the blocks

float numParams = sizeX * sizeY * sizeZ;

const unsigned grid_size = (unsigned)(ceil(numParams/(float)(block_size)));

// Set up our grid and thread blocks

dim3 block(block_size, 1, 1);

dim3 grid(grid_size, 1, 1);

[/codebox]

However, in my kernel I am having problems trying to figure out how to index everything correctly. So, my original CPU code was:

[codebox]

for (int i = 0; i < sizeX; ++i)

{
for (int j = 0; j < sizeY; ++j)

{

     for (int k = 0; k < sizeZ; ++k)

     {              

     }

}
}

[/codebox]

What would be the corresponding i. j, k values in the kernel code for the thread if I have set it up as I described before?

Cheers,

xarg

Topic		Replies	Views
Program hangs at cudaThreadsynchronize CUDA Programming and Performance	12	9790	April 7, 2011
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23704	October 7, 2010
Queries on Cuda threads sync CUDA Programming and Performance	3	930	March 21, 2016
How can I be certain my Kernel runs with 32 threads in one block and thus perfect synchrony? (ie. via __syncthreads()) CUDA Programming and Performance	15	312	August 21, 2024
How to synchronize a Kernel with many for loops CUDA Programming and Performance	12	12202	November 28, 2011
launch another kernel from a kernel on the GPU itself CUDA Programming and Performance	3	4032	June 6, 2009
loop execution inside kenel VS outside of it CUDA Programming and Performance	3	2753	January 15, 2008
cudaThreadSynchronize usage CUDA Programming and Performance	3	2993	October 21, 2008
incomprehensible behaviour limitations on kernel calls for host function? CUDA Programming and Performance	12	7177	April 28, 2011
Synchronization methods? CUDA Programming and Performance	11	2288	November 7, 2010

calling cudaThreadSynchronize from a kernel

Related topics