Is it possible to call cudaThreadSynchronize from a CUDa kernel?
I have a CUDA kernel that handles a big loop in parallel and then I would like it to call another kernel which would handle a subset of the data (this kernel might in turn call another kernel) and I would like to use cudaThreadSynchronize to ensure that all the kernels finish before we go on.
Thanks for the reply jjp! It’s a shame though about this limitation.
So, if I have something like this:
[codebox]
for (int i = 0; i < someZ; ++i)
{
for (int k = 0; k < someZ; ++k)
{
for (int j = 0; j < someZ; ++j)
{
// Some processing here
for (int x = 0; x < someX; ++x)
{
for (int y = 0; y < someY; ++y)
{
// Some processing here
}
}
}
}
}
[/codebox]
So, in this scenario, I can have a kernel handling the outside 3 loops in parallel but the inside 3 loops will have to be executed by each thread, right? How does one currently handle these kind of situations?
Thanks for the reply. However, how about the scenario I mentioned in my last reply? I can do the outer 3 loops in parallel and do the required processing but then the inner two loops cannot be made parallel, right?
If I could launch another kernel, I could have handled the outer 3 loops in my top kernel and then launch another kernel with the updated parameters to handle the inner two loops or have I missed something here (quite likely!)
for (int k = 0; k < someZ; ++k){
for (int j = 0; j < someZ; ++j){
// Some processing here
for (int x = 0; x < someX; ++x){
for (int y = 0; y < someY; ++y){
// Some processing here
}// for y
}// for x
}// for j
}// for k
}// for i
[/codebox]
I think your problem is “how to use grid configuration to label 3 nested-loop”
my approach is to access data element slice by slice
for example: if I have a 3D data X(0:n1-1, 0:n2-1, 0:n3-1) and I want to do transpose operation,
say X(i,j,k) → Y(j,k,i) where Y(0:n2-1, 0:n3-1, 0:n1-1).
Then first I define a square block (for example, dim3 threads(16, 16, 1) ), then
I want to cut data to x-z slice, say
one x-zslice (y is fixed) require M = Gx * Gz grids, where Gx = ceil(n1/16), Gz = ceil(n3/16)
then total number of grids required is M * n2.
Finially I organize M * n2 grids into 2D configuration.
for exmaple: find k1, k2 such that k1 * k2 - n2 <= 1, then issue dim3 grid( k2Gz, k1Gx, 1 )
the pseudo-code is
[codebox]void foo( doublereal *X, unsigned int n1, unsigned int n2, unsigned int n3 )