How to launch a graph from the device?

This post shows how to launch a graph from a device. In the Conclusion of the post, it also mentions that To try device graph launch, download CUDA Toolkit 12.0.

In the Runtime API of CUDA 12.0.0, I discover two things.

  1. __host__ ​cudaError_t cudaGraphLaunch (cudaGraphExec_t graphExec, cudaStream_t stream).
    I notice the method cudaGraphLaunch is modified by __host__, so in theory, this function is only callable from the CPU. How come the post can invoke this method in the kernel?

  2. I can’t find anything about cudaStreamGraphFireAndForget and cudaStreamGraphTailLaunch which are used in the post.

After that, I check the Runtime API of other CUDA versions(from 12.0.1 to 12.3.1). In 12.3.0, the declaration of cudaGraphLaunch changes to :
__host__​ __device__​ cudaError_t cudaGraphLaunch(cudaGraphExec_t graphExec, cudaStream_t stream).
We can seecudaGraphLaunch is added with __device__ modifier. But I still can’t find cudaStreamGraphFireAndForget and cudaStreamGraphTailLaunch from CUDA 12.0.1 to 12.3.1(the latest version currently).

My question is does CUDA support launching graphs from the device and how to achieve that?

Any help will be appreciated.

Jack

In CUDA 12.2 (the version I happen to have handy) grep shows these definitions:

# grep -R Fire /usr/local/cuda/include/*
/usr/local/cuda/include/cuda_device_runtime_api.h:#define cudaStreamGraphFireAndForget          (cudaStream_t)0x0200000000000000
/usr/local/cuda/include/cuda_device_runtime_api.h:#define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
/usr/local/cuda/include/cuda_device_runtime_api.h:#define cudaStreamFireAndForget             ((cudaStream_t)0x4) /**< Per-grid stream with a tail launch semantics. Only applicable when used with CUDA Dynamic Parallelism. */
#

Likewise grepping for cudaGraphLaunch yields important results, indicative of __device__ decoration.

# grep -R cudaGraphLaunch /usr/local/cuda/include/*
/usr/local/cuda/include/cuda_device_runtime_api.h:extern  __device__ __cudart_builtin__ cudaError_t CUDARTAPI cudaGraphLaunch(cudaGraphExec_t graphExec, cudaStream_t stream);
/usr/local/cuda/include/cuda_device_runtime_api.h:static inline  __device__ __cudart_builtin__ cudaError_t CUDARTAPI cudaGraphLaunch_ptsz(cudaGraphExec_t graphExec, cudaStream_t stream)
/usr/local/cuda/include/cuda_device_runtime_api.h:    return  cudaGraphLaunch(graphExec, stream);
/usr/local/cuda/include/cuda.h: * If \p hGraph contains kernels which call device-side cudaGraphLaunch() from multiple
/usr/local/cuda/include/cuda.h: * If \p hGraph contains kernels which call device-side cudaGraphLaunch() from multiple
/usr/local/cuda/include/cuda.h: *   cudaGraphLaunch() from multiple contexts. \p hErrNode_out will be set to this node.
/usr/local/cuda/include/cuda.h: *     did not use device-side cudaGraphLaunch() cannot be updated to a function which uses
/usr/local/cuda/include/cuda.h: *     device-side cudaGraphLaunch() unless the node resides on the same context as nodes which
/usr/local/cuda/include/cuda.h: *     did not use device-side cudaGraphLaunch() cannot be updated to a function which uses
/usr/local/cuda/include/cuda.h: *     device-side cudaGraphLaunch() unless the node resides on the same context as nodes which
/usr/local/cuda/include/cuda_runtime_api.h:    #define cudaGraphLaunch                __CUDART_API_PTSZ(cudaGraphLaunch)
/usr/local/cuda/include/cuda_runtime_api.h: * call device-side cudaGraphLaunch() from multiple devices, this will result in an error.
/usr/local/cuda/include/cuda_runtime_api.h: * ::cudaGraphLaunch,
/usr/local/cuda/include/cuda_runtime_api.h: * If \p graph contains kernels which call device-side cudaGraphLaunch() from multiple
/usr/local/cuda/include/cuda_runtime_api.h: * ::cudaGraphLaunch,
/usr/local/cuda/include/cuda_runtime_api.h: * If \p graph contains kernels which call device-side cudaGraphLaunch() from multiple
/usr/local/cuda/include/cuda_runtime_api.h: *   cudaGraphLaunch() from multiple devices. \p errNode_out will be set to this node.
/usr/local/cuda/include/cuda_runtime_api.h: *     did not use device-side cudaGraphLaunch() cannot be updated to a function which uses
/usr/local/cuda/include/cuda_runtime_api.h: *     device-side cudaGraphLaunch() unless the node resides on the same device as nodes which
/usr/local/cuda/include/cuda_runtime_api.h: * ::cudaGraphLaunch
/usr/local/cuda/include/cuda_runtime_api.h: * ::cudaGraphLaunch
/usr/local/cuda/include/cuda_runtime_api.h: *     did not use device-side cudaGraphLaunch() cannot be updated to a function which uses
/usr/local/cuda/include/cuda_runtime_api.h: *     device-side cudaGraphLaunch() unless the node resides on the same device as nodes which
/usr/local/cuda/include/cuda_runtime_api.h: * ::cudaGraphLaunch,
/usr/local/cuda/include/cuda_runtime_api.h:extern __host__ cudaError_t CUDARTAPI cudaGraphLaunch(cudaGraphExec_t graphExec, cudaStream_t stream);
/usr/local/cuda/include/cuda_runtime_api.h: * ::cudaGraphLaunch
/usr/local/cuda/include/cuda_runtime_api.h:    #undef cudaGraphLaunch
/usr/local/cuda/include/cuda_runtime_api.h:    extern __host__ cudaError_t CUDARTAPI cudaGraphLaunch(cudaGraphExec_t graphExec, cudaStream_t stream);
/usr/local/cuda/include/cuda_runtime.h: * ::cudaGraphLaunch,
#

I don’t see any evidence/support for the claims you have made, and my guess would be that the proper avenue of prosecution would be to follow the instructions in the blog you linked.

1 Like

Here is a simple “hello world” style device graph launch example, using CUDA 12.2 on a L4:

# cat t116.cu
#include <iostream>

__global__ void parentKernel(
    cudaGraphExec_t deviceGraph)
{
    cudaGraphLaunch(deviceGraph, cudaStreamGraphFireAndForget);
}

__global__ void childKernel(int *d, int n) {for (int i = 0; i < n; i++) d[i] = i;}

void create_graph(cudaGraph_t *dg, int *d, int n){
    cudaStream_t s;
    cudaStreamCreate(&s);
    cudaStreamBeginCapture(s, cudaStreamCaptureModeGlobal);
    childKernel<<<1, 1, 0, s>>>(d, n);
    cudaStreamEndCapture(s, dg);
    cudaStreamDestroy(s);
}

void setupAndLaunch(int *data, int n) {
    cudaGraph_t deviceGraph, parentGraph;
    cudaGraphExec_t deviceExec, parentExec;
    cudaStream_t stream;
    cudaStreamCreate(&stream);
    // Create the source graph for device graph operation we want to perform
    create_graph(&deviceGraph, data, n);

    // Instantiate the graph for this operation and explicitly upload
    cudaGraphInstantiate(&deviceExec, deviceGraph, cudaGraphInstantiateFlagDeviceLaunch);
    cudaGraphUpload(deviceExec, stream);

    // Create and instantiate the parent graph
    cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
    parentKernel<<<1, 1, 0, stream>>>(deviceExec);
    cudaStreamEndCapture(stream, &parentGraph);
    cudaGraphInstantiate(&parentExec, parentGraph, cudaGraphInstantiateFlagDeviceLaunch);

    // Launch the parent graph - this will perform an implicit upload
    cudaGraphLaunch(parentExec, stream);
    cudaDeviceSynchronize();
    cudaStreamDestroy(stream);
    cudaGraphDestroy(deviceGraph);
    cudaGraphDestroy(parentGraph);
}

int main(){
        int *d;
        int n = 10;
        cudaMalloc(&d, n*sizeof(d[0]));
        setupAndLaunch(d, n);
        int *hd = new int[n];
        cudaMemcpy(hd, d, n*sizeof(d[0]), cudaMemcpyDeviceToHost);
        cudaError_t err = cudaGetLastError();
        if (err == cudaSuccess)
          for (int i = 0; i < n; i++) std::cout << hd[i] << " ";
        else std::cout << "Error: " << cudaGetErrorString(err);
        std::cout << std::endl;
}
# nvcc -o t116 t116.cu
# compute-sanitizer ./t116
========= COMPUTE-SANITIZER
0 1 2 3 4 5 6 7 8 9
========= ERROR SUMMARY: 0 errors
#

I haven’t done careful error checking, nor have I done careful resource clean-up. It is just to illustrate the general API flow to create and launch a device graph.

2 Likes

Thanks for your detailed response and experiments.

All the above phenomenon I talked about is based on online documentation. Maybe the documentation is not updated on time? I am confused.

I want to make sure from which version the CUDA support graph launches from the device. Is the only way to do this to download the CUDA toolkits and check the header files(e.g. cuda_runtime_api.h)? Could you please provide other suggestions?

I need it because I am trying to use the macro CUDART_VERSION to compile different sections of the code to use this feature(launching graphs from the device) if it is supported.

The CUDA docs are not perfect. The usual advice I offer if you find an issue or want to see an improvement is to file a bug.

You could use godbolt compiler explorer. I use it sometimes. For example, for the code I have posted above, if I put it into godbolt and select CUDA 12.0.0 (the version suggested in the blog), it seems to compile correctly. (if you switch the compiler version to 11.8, for example, the compilation will fail.)

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.