Synchronization in Dynamic Parallelism

I have studied many posts with similar titles, however, none of the resolutions satisfies my needs. In a nutshell, I want to replace a massive loop over the same kernel (always loading the same, fixed matrix) into one kernel-call with a single load of the matrix and a subsequent loop. The problem is not even loading the NxN matrix repeatedly but launching a kernel with N blocks for N^2 times when N>1000, for which the overhead completely dominates the computation time. But having the loop inside a parent kernel requires dynamic parallelism, ie., calls to child-kernels by certain threads within the loop, which would require across-block synchronization before and/or after such calls at parent-level, for which cudaDeviceSynchronize() (or any other sync) apparently does not work. (The typical advice is to end the parent kernel to sync, which is not an option for me here!)
I enclosed a complete model CUDA-C code to illustrate my problem, where the cudaDeviceSynchronize() commands in the parent kernel don’t have the desired effect and the while-loop races ahead uncontrollably (as the “loop” variable in each thread indicates). Here, each block of the parent loads a row of a fixed matrix M and repeatedly adds the rows to an evolving vector v. Each block results in a sum w which then the child trivially puts back into v for the next iteration. (In reality, more complex operations are involved.) Instead of the default inputs, running with n=8 and maxiter=10 should illustrate the problem.

/*  nvcc -arch=sm_35 -rdc=true DynKerTestRedux.cu -o DynKerTestRedux.x  */
#include<stdio.h>
#include<stdlib.h>
#define MAXTHREADS 256
#define MIN(a, b) ((a) < (b) ? (a) : (b))

__global__
void child(float *v,float *w,int n){
    int index = blockDim.x*blockIdx.x+threadIdx.x;
    v[index] = w[index];
    printf("child: %5d %f\n",index,v[index]);
}

__global__
void parent(float *M,float *v,float *w,int *iter,int *maxiter,int maxiterations,int mingrid,int minblock,int n){
    extern __shared__ float Mrow[];//one row of matrix in each block
    __shared__ float r[MAXTHREADS];//needed for efficient reduction
    int i,j,loop=0;
    float thisr;
    
    i = threadIdx.x;
    while(i < n){
        Mrow[i] = M[n*blockIdx.x+i];//shared storage, loaded only once
        i += blockDim.x;//best for coalescence
    }
    if( (threadIdx.x == 0) && (blockIdx.x == 0)){
        *iter = 0;
        *maxiter = maxiterations;
    }
    __syncthreads();
    do {//iter many times
        i = threadIdx.x;
        thisr = 0;
        while(i < n){
            thisr += v[i] + Mrow[i];
            i += blockDim.x;//best for coalescence
        }
        r[threadIdx.x] = thisr;
        __syncthreads();
        for (j=blockDim.x/2; j>0; j>>=1) {//reduction
            if (threadIdx.x < j) r[threadIdx.x] += r[threadIdx.x + j];
            __syncthreads();
        }
        if (threadIdx.x == 0){
            w[blockIdx.x] = r[0]/n;
            printf("parent: %5d %5d %5d %f\n",blockIdx.x,*iter,loop++,w[blockIdx.x]);
        }
        cudaDeviceSynchronize();
        if( (threadIdx.x == 0) && (blockIdx.x == 0)){//single-thread process!!!
            child<<<mingrid,minblock>>>(v,w,n);
            atomicAdd(iter,1);
        }
        cudaDeviceSynchronize();
    }while( *iter < *maxiter );
}

int main(int argc, char **argv){
    float  *h_M,*M,*h_v,*v,*w;
    int n,minblock,mingrid,i,j,maxiterations;
    int *iter,*maxiter;
    cudaSetDevice(1);
    n = 8192;//typical defaults
    maxiterations = 1000000;
    if(argc > 2){
        n = atoi(argv[1]);
        maxiterations = atoi(argv[2]);
    }
    minblock = MIN(MAXTHREADS,n);
    mingrid = (n + minblock - 1) / minblock;
    printf("init: %5d %5d %5d\n",n,mingrid,minblock);
    h_M = (float *) malloc (sizeof(float) * n * n);//flattened matrix M
    for( i=0; i<n; i++) for( j=0; j<n; j++) h_M[i*n+j] = 1;
    h_v = (float *) malloc (sizeof(float) * n);
    for( i=0; i<n; i++) h_v[i] = 0;
    cudaMalloc((void**)&iter, sizeof(int));
    cudaMalloc((void**)&maxiter, sizeof(int));
    cudaMalloc((void**)&M, sizeof(float) * n * n);
    cudaMalloc((void**)&v, sizeof(float) * n);
    cudaMalloc((void**)&w, sizeof(float) * n);
    cudaMemcpy(M, h_M, sizeof(float) * n * n,cudaMemcpyHostToDevice);
    cudaMemcpy(v, h_v, sizeof(float) * n,cudaMemcpyHostToDevice);
    cudaDeviceSynchronize();
    parent<<<n,minblock,n*sizeof(float)>>>(M,v,w,iter,maxiter,maxiterations,mingrid,minblock,n);
    cudaDeviceSynchronize();
    cudaMemcpy(h_v, v, sizeof(float) * n,cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    for( i=0; i<n; i++) printf("result: %5d %f\n",i,h_v[i]);
    return(0);
}

correct. cudaDeviceSynchronize() in device code never served that purpose. It was never a device-wide sync.

CUDA has introduced cooperative groups (CG) which can provide a device-wide sync for newer device architectures (Pascal or newer); it is not usable on cc3.5 device. Also, CG device-wide sync is not usable with CDP.

Dynamic parallelism launches do not eliminate launch overhead in typical usage. In typical usage, the launch overhead from device-side launch is equivalent to the launch overhead from host-side launch.

The only way launch overhead completely dominates execution time is if the work done per launch is quite small. You should definitely seek to refactor your code away from multiple kernel launches in that case.

I think that is possible with the code you have shown, but I can already anticipate the objection “but this isn’t my real code”. So I’ll skip that exercise.

If you move forward to a Pascal or newer device, I’m fairly confident you can refactor your code that you have shown, to use CG device wide sync, with no CDP.

1 Like

Thanks so much for the rapid response! I was hoping that there might be some work-around, like, using atomics on the counter in global memory (“iter”) that could halt all threads in all blocks with some construction like defining a register “halt” in all that gets set as halt=*iter at the beginning of the loop and the traps all threads in all blocks with an empty loop “while(halt==*iter){};” until the one operating thread updates *iter atomically. This almost works when iter is defined volatile but seems very fickle.

However, my main concern was less about CDP per-se (and so I doubt that CG would solve my problem). I used CDP only because I want to stuff a huge number of iterations inside a parent kernel (thus requiring child-kernels to be called using CDP) instead of having iterative calls of that parent-kernel (allowing to refactor the child without CDP). In light of your discouraging answer, I looked at the actual difference in computational cost of loop-of-kernels versus kernel-of-loops when I eliminate the child kernel entirely and iterate N^2 times merely the NxN matrix-vector operation (up to N=2048). To my surprise, there was only a factor of at most 2 difference! So, launching N^2 times a kernel with N blocks and 1024 threads, then loading an NxN matrix from global memory repeatedly (albeit coalesced), only costs little more time than doing that step only once and then looping N^2 times within the kernel over the matrix-vector operation. I would have expected that the former scales with ~N^3 for launching N blocks N^2 times, whereas the latter should stay at ~N^2, launching those N blocks only once. (The matrix-vector operations are done in parallel at ~N^0 cost.) I’m stunned! What am I missing?

Not sure what you are missing.

If the launch overhead is small compared to the work being done per launch, then the cost of doing each launch individually vs. a loop in a kernel is going to be relatively small - in line with your observation (I guess). Furthermore, if the cost to load the data seems to not matter, then again one possibility is that it is small compared to the cost of the work being done. GPUs can sometimes hide the latency associated with this. A compute-bound kernel may not be impacted much by having to load data.

We could model the first (multiple launch) case as:

TC1 = (O + L + W)*N  (total cost = launch overhead + load cost + work cost)

We could model the single launch case as:

TC2 = O + L + W*N

If W is large compared to O and/or L, then the TC1 might be approximately the same as TC2.
Just speculation based on your observations.