Scheduling a kernel asynchronously from inside another kernel

In my application I need three kernels (let us call them first, second and third) which operate one after the other on different grids:

__global__ void fist(first_data* data)
{
        ...
}

__global__ void second(second_data* data)
{
        ...
}

__global__ void third(third_data* data, output* device_out)
{
        ...
}

after that, at the present time I have to copy on host the output of the last kernel (which are few bytes)

cudaMemcpyAsync(&host_out, &device_out, sizeof(output), cudaMemcpyDeviceToHosto, some_stream)

after that, on CPU, depending on the host_out I have to decide whether to reiterate the procedure (first - second - third kernels) or to stop (in this case host_out data will be used elsewhere).

Now, even if I used pinned memory for my output structure, I observe a lot of latency in the timeline between the third kernel, the actual cudaMemcpy and the next kernel (as you can see in the picture: blue boxes on the top line are the kernels while the fuchsia peak in bottom line is the cudaMemcpy).
image

What I would like to do is to make the third kernel take the decision of reiterate the kernel sequence or not:

  • if it is the case I would like the third kernel to asynchronously schedule the first (keep in mind the the kernels operate on different grids)
  • if it is not the case I would like the third kernel to just call the cudaMemcpy to download the (final) output structure.

Is it possible to do something like that? I read that it is possible to call a kernel from inside another kernel but - I think - it is not what I would like to do here.

Pinned memory can be accessed by the device. You could just have the kernel write to the memory directly. Then you do no longer need the memcpy call.

For CUDA Dynamic Parallelism (launching kernels from kernels), all you really need is to compile with -rdc=true, and schedule the work from within a kernel. Keep in mind that cuda 12 reworked CDP, which is now CDP2.

Here is an example how to repeat a sequence of three kernel based on a condition.

#include <iostream>
#include <cassert>

__global__
void kernelA(){
    printf("A\n");
}

__global__
void kernelB(){
    printf("B\n");
}

__global__
void kernelC(int* output){
    printf("C\n");
    output[0]++;
}

#if __CUDACC_VER_MAJOR__ < 12

__global__
void cdp1_launcherkernel(int* output){
    while(*output < 4){
        kernelA<<<1,1>>>();
        kernelB<<<1,1>>>();
        kernelC<<<1,1>>>(output);
        cudaDeviceSynchronize();
    }
}

#endif

#if __CUDACC_VER_MAJOR__ >= 12

__global__
void relauncherKernel(int* output);


__global__
void cdp2_launcherkernel(int* output){
    kernelA<<<1,1,0,cudaStreamTailLaunch>>>();
    kernelB<<<1,1,0,cudaStreamTailLaunch>>>();
    kernelC<<<1,1,0,cudaStreamTailLaunch>>>(output);
    relauncherKernel<<<1,1,0,cudaStreamTailLaunch>>>(output);
}

__global__
void relauncherKernel(int* output){
    if(*output < 4){
        cdp2_launcherkernel<<<1,1,0,cudaStreamTailLaunch>>>(output);
    }
}

#endif



int main(){
    int* output;
    cudaMallocHost(&output, sizeof(int));
    *output = 0;

    std::cout << "host approach\n";
    while(*output < 4){
        kernelA<<<1,1>>>();
        kernelB<<<1,1>>>();
        kernelC<<<1,1>>>(output);
        cudaDeviceSynchronize();
    }
    assert(*output == 4);

    #if __CUDACC_VER_MAJOR__ < 12
    *output = 0;
    std::cout << "CDP1 approach\n";
    cdp1_launcherkernel<<<1,1>>>(output);
    cudaDeviceSynchronize();
    assert(*output == 4);
    #endif

    #if __CUDACC_VER_MAJOR__ >= 12
    *output = 0;
    std::cout << "CDP2 approach\n";
    cdp2_launcherkernel<<<1,1>>>(output);
    cudaDeviceSynchronize();
    assert(*output == 4);
    #endif
}
1 Like

Thank you for the answer.
My only doubt is what happen if the thread and block argument inside the kernel launching function is greater than 1. Is thi allowed?
And then will the A, B an C kernel be launched in different blocks amd threads respect to the kenrel launching function?

The launcher kernel is just another kernel. If it has more than one thread, each thread will launch a kernel.

I suggest reading the CDP section , Section 9, in the programming guide: CUDA C++ Programming Guide