Is it possible to call cudaThreadSynchronize from a CUDa kernel?

I have a CUDA kernel that handles a big loop in parallel and then I would like it to call another kernel which would handle a subset of the data (this kernel might in turn call another kernel) and I would like to use cudaThreadSynchronize to ensure that all the kernels finish before we go on.

Thanks for the reply jjp! It’s a shame though about this limitation.

So, if I have something like this:

[codebox]

for (int i = 0; i < someZ; ++i)

{

for (int k = 0; k < someZ; ++k)
{
for (int j = 0; j < someZ; ++j)
{
// Some processing here
for (int x = 0; x < someX; ++x)
{
for (int y = 0; y < someY; ++y)
{
// Some processing here
}
}
}
}

}

[/codebox]

So, in this scenario, I can have a kernel handling the outside 3 loops in parallel but the inside 3 loops will have to be executed by each thread, right? How does one currently handle these kind of situations?

Thanks for the reply. However, how about the scenario I mentioned in my last reply? I can do the outer 3 loops in parallel and do the required processing but then the inner two loops cannot be made parallel, right?

If I could launch another kernel, I could have handled the outer 3 loops in my top kernel and then launch another kernel with the updated parameters to handle the inner two loops or have I missed something here (quite likely!)

for (int k = 0; k < someZ; ++k){
for (int j = 0; j < someZ; ++j){
// Some processing here
for (int x = 0; x < someX; ++x){
for (int y = 0; y < someY; ++y){
// Some processing here
}// for y
}// for x
}// for j
}// for k

}// for i

[/codebox]

I think your problem is “how to use grid configuration to label 3 nested-loop”

my approach is to access data element slice by slice

for example: if I have a 3D data X(0:n1-1, 0:n2-1, 0:n3-1) and I want to do transpose operation,

say X(i,j,k) --> Y(j,k,i) where Y(0:n2-1, 0:n3-1, 0:n1-1).

Then first I define a square block (for example, dim3 threads(16, 16, 1) ), then

I want to cut data to x-z slice, say

one x-zslice (y is fixed) require M = Gx * Gz grids, where Gx = ceil(n1/16), Gz = ceil(n3/16)

then total number of grids required is M * n2.

Finially I organize M * n2 grids into 2D configuration.

for exmaple: find k1, k2 such that k1 * k2 - n2 <= 1, then issue dim3 grid( k2Gz, k1Gx, 1 )

the pseudo-code is

[codebox]void foo( doublereal *X, unsigned int n1, unsigned int n2, unsigned int n3 )