Scheduling a kernel asynchronously from inside another kernel

matteo.poggi.fi · May 11, 2023, 4:38pm

In my application I need three kernels (let us call them first, second and third) which operate one after the other on different grids:

__global__ void fist(first_data* data)
{
        ...
}

__global__ void second(second_data* data)
{
        ...
}

__global__ void third(third_data* data, output* device_out)
{
        ...
}

after that, at the present time I have to copy on host the output of the last kernel (which are few bytes)

cudaMemcpyAsync(&host_out, &device_out, sizeof(output), cudaMemcpyDeviceToHosto, some_stream)

after that, on CPU, depending on the host_out I have to decide whether to reiterate the procedure (first - second - third kernels) or to stop (in this case host_out data will be used elsewhere).

Now, even if I used pinned memory for my output structure, I observe a lot of latency in the timeline between the third kernel, the actual cudaMemcpy and the next kernel (as you can see in the picture: blue boxes on the top line are the kernels while the fuchsia peak in bottom line is the cudaMemcpy).

What I would like to do is to make the third kernel take the decision of reiterate the kernel sequence or not:

if it is the case I would like the third kernel to asynchronously schedule the first (keep in mind the the kernels operate on different grids)
if it is not the case I would like the third kernel to just call the cudaMemcpy to download the (final) output structure.

Is it possible to do something like that? I read that it is possible to call a kernel from inside another kernel but - I think - it is not what I would like to do here.

striker159 · May 11, 2023, 5:55pm

Pinned memory can be accessed by the device. You could just have the kernel write to the memory directly. Then you do no longer need the memcpy call.

For CUDA Dynamic Parallelism (launching kernels from kernels), all you really need is to compile with -rdc=true, and schedule the work from within a kernel. Keep in mind that cuda 12 reworked CDP, which is now CDP2.

Here is an example how to repeat a sequence of three kernel based on a condition.

#include <iostream>
#include <cassert>

__global__
void kernelA(){
    printf("A\n");
}

__global__
void kernelB(){
    printf("B\n");
}

__global__
void kernelC(int* output){
    printf("C\n");
    output[0]++;
}

#if __CUDACC_VER_MAJOR__ < 12

__global__
void cdp1_launcherkernel(int* output){
    while(*output < 4){
        kernelA<<<1,1>>>();
        kernelB<<<1,1>>>();
        kernelC<<<1,1>>>(output);
        cudaDeviceSynchronize();
    }
}

#endif

#if __CUDACC_VER_MAJOR__ >= 12

__global__
void relauncherKernel(int* output);


__global__
void cdp2_launcherkernel(int* output){
    kernelA<<<1,1,0,cudaStreamTailLaunch>>>();
    kernelB<<<1,1,0,cudaStreamTailLaunch>>>();
    kernelC<<<1,1,0,cudaStreamTailLaunch>>>(output);
    relauncherKernel<<<1,1,0,cudaStreamTailLaunch>>>(output);
}

__global__
void relauncherKernel(int* output){
    if(*output < 4){
        cdp2_launcherkernel<<<1,1,0,cudaStreamTailLaunch>>>(output);
    }
}

#endif



int main(){
    int* output;
    cudaMallocHost(&output, sizeof(int));
    *output = 0;

    std::cout << "host approach\n";
    while(*output < 4){
        kernelA<<<1,1>>>();
        kernelB<<<1,1>>>();
        kernelC<<<1,1>>>(output);
        cudaDeviceSynchronize();
    }
    assert(*output == 4);

    #if __CUDACC_VER_MAJOR__ < 12
    *output = 0;
    std::cout << "CDP1 approach\n";
    cdp1_launcherkernel<<<1,1>>>(output);
    cudaDeviceSynchronize();
    assert(*output == 4);
    #endif

    #if __CUDACC_VER_MAJOR__ >= 12
    *output = 0;
    std::cout << "CDP2 approach\n";
    cdp2_launcherkernel<<<1,1>>>(output);
    cudaDeviceSynchronize();
    assert(*output == 4);
    #endif
}

matteo.poggi.fi · May 11, 2023, 6:41pm

Thank you for the answer.
My only doubt is what happen if the thread and block argument inside the kernel launching function is greater than 1. Is thi allowed?
And then will the A, B an C kernel be launched in different blocks amd threads respect to the kenrel launching function?

striker159 · May 12, 2023, 6:48am

The launcher kernel is just another kernel. If it has more than one thread, each thread will launch a kernel.

I suggest reading the CDP section , Section 9, in the programming guide: CUDA C++ Programming Guide

Topic		Replies	Views
calling cudaThreadSynchronize from a kernel CUDA Programming and Performance	7	4680	September 2, 2009
Multiple kernels in flight? CUDA Programming and Performance	19	26819	August 28, 2007
Question about CUDA streams CUDA Programming and Performance	8	728	November 8, 2019
Running CUDA on multicore system. CUDA Programming and Performance	3	9148	September 9, 2007
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25009	March 8, 2010
Unable to run kernel on device 1 with memory in device 2 CUDA Programming and Performance	10	912	January 24, 2017
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	943	February 1, 2022
Warning whem using cuda in thread? CUDA Programming and Performance	4	665	March 10, 2017
same kernel on different data CUDA Programming and Performance	3	1415	November 17, 2008
Very quick question regard aync CUDA Programming and Performance	4	2707	June 25, 2008

Scheduling a kernel asynchronously from inside another kernel

Related topics