Is there any way to launch a graph from the HOST node?

I hope to find a way to launch a graph from the HOST node.
According to the Runtime API documentation, I know I can’t directly call the CUDA function in the callback of the HOST node. So that I can’t directly call cudaLaunch() to launch an executable graph in the HOST node either.
After consideration, I think I can create a child thread in the HOST node, and call cudaLaunch() in the child thread. I suppose the child thread maybe has nothing to do with the graph so this way should work. I wrote the code but the program halts at the cudaStreamSynchronize(stream) in the child thread invoked from the HOST node and it doesn’t show any errors.

I know maybe creating a child thread is not a good solution but it is the only one I figured out after several hours.
I am using CUDA11.4 which doesn’t support conditional node type(supported by CUDA12.3.1).
I hope to add some control flow like if statement in the HOST node to determine whether to launch the graph. So the child node type doesn’t meet my needs.

I hope to figure out why the program halts at the cudaStreamSynchronize(stream) and to achieve graph launches from the HOST node. My code is as below:

#include<stdio.h>
#include<pthread.h>
#define CUDA_CHECK_ERROR(f, msg)\
{\
    cudaError_t error = f;\
    if (error != cudaSuccess) {\
        printf("File %s Line %d occurs error, error msg : %s, error code : %d, error name : %s, user msg : %s\n", __FILE__, __LINE__, cudaGetErrorString(error), error, cudaGetErrorName(error), msg);\
    }\
}
// called by child thread.
void* launchGraph(void* userData) {
    cudaGraphExec_t graphExec = *(cudaGraphExec_t*)userData;
    cudaStream_t stream;
    CUDA_CHECK_ERROR(cudaStreamCreate(&stream), "call cudaStreamCreate failed.");
    printf("after cudaStreamCreate.\n");
    CUDA_CHECK_ERROR(cudaGraphLaunch(graphExec, stream), "call cudaGraphLaunch failed.");
    printf("after cudaGraphLaunch.\n");
  //************The program halts here and can't proceed************
  //************The program halts here and can't proceed************
    CUDA_CHECK_ERROR(cudaStreamSynchronize(stream), "call cudaStreamSynchronize failed.");     printf("after cudaStreamSynchronize.\n");
    cudaStreamDestroy(stream);
    return nullptr;
}
// the callback function of the HOST node
void hostNodeCallBack(void* userData) {
    // create a thread to launch the child executable graph
    pthread_t thread;
    pthread_create(&thread, NULL, launchGraph, userData);
    pthread_join(thread, nullptr);
}
int main() {
    // 1. create a child graph
    cudaGraph_t childGraph;
    cudaGraphExec_t childGraphExec;
    cudaGraphCreate(&childGraph, 0);
    cudaGraphNode_t node;
    cudaGraphAddEmptyNode(&node, childGraph, NULL, 0);
    cudaGraphInstantiate(&childGraphExec, childGraph, nullptr, nullptr, 0);
    // 2. create a parent graph containing a host node
    cudaGraph_t parentGraph;
    cudaGraphExec_t parentGraphExec;
    cudaGraphCreate(&parentGraph, 0);
    // 2.1 add a host node to parent graph
    cudaHostNodeParams params;
    params.fn = hostNodeCallBack;
    params.userData = &childGraphExec; // pass the child executable graph as argument
    cudaGraphAddHostNode(&node, parentGraph, NULL, 0, &params);
    // 2.2 launch parent graph
    cudaGraphInstantiate(&parentGraphExec, parentGraph, nullptr, nullptr, 0);
    cudaGraphLaunch(parentGraphExec, cudaStreamDefault);
    cudaStreamSynchronize(cudaStreamDefault);

    // release resources
    return 0;
}

Any help will be appreciated.

Jack

My guess would be that you are creating a deadlock by executing the parent graph in the default stream.

The host callback waits until the child thread returns. The child thread waits until the child graph finished completion. The child graph waits for the parent graph to complete since it belongs to the default stream. The parent graph cannot complete because its host callback is waiting.

(There is another thing I would like to comment: I don’t think it is correct to use cudaStreamDefault as a stream argument. cudaStreamDefault is one of two flags that can be passed to cudaStreamCreateWithFlags, and just happens to be defined as 0. A valid named stream with “default stream semantics” would be cudaStreamLegacy)

1 Like

you can launch a child graph from a graph node. Also see here.

Thanks a lot, you are right. I indeed created a deadlock and misunderstood the usage ofcudaDefaultStream.
When I launch the parent graph in a regular stream, the program proceeds and finishes successfully.

But there is another problem. when I change the node type in the child executable graph, which is launched by the child thread, from EMPTY node type to HOST node type, the same thing happens. the program halts at the cudaStreamSynchronize(stream) in the child thread invoked from the HOST node and it doesn’t show any errors . The code is as below, I just changed two places commented as CHANGE 1 and CHANGE 2.

// .........
// other codes are same as before
void printSomething(void* userData) {
    printf("printSomething from child graph.\n");
}
int main() {
    // 1. create a child graph
    cudaGraph_t childGraph;
    cudaGraphExec_t childGraphExec;
    cudaGraphCreate(&childGraph, 0);
    cudaGraphNode_t node;
    // cudaGraphAddEmptyNode(&node, childGraph, NULL, 0);
    // *****CHANGE 1: add a host node rather than an empty node to the child graph*****
    // *****CHANGE 1: add a host node rather than an empty node  to the child graph*****
    cudaHostNodeParams hostNodeParams;
    hostNodeParams.fn = printSomething;
    hostNodeParams.userData = nullptr;
    cudaGraphAddHostNode(&node, childGraph, NULL, 0, &hostNodeParams);
    cudaGraphInstantiate(&childGraphExec, childGraph, nullptr, nullptr, 0);
    // 2. create a parent graph containing a host node
    cudaGraph_t parentGraph;
    cudaGraphExec_t parentGraphExec;
    cudaGraphCreate(&parentGraph, 0);
    // 2.1 add a host node to parent graph
    cudaHostNodeParams params;
    params.fn = hostNodeCallBack;
    params.userData = &childGraphExec; // pass the child executable graph as argument
    cudaGraphAddHostNode(&node, parentGraph, NULL, 0, &params);
    // 2.2 launch parent graph
    cudaGraphInstantiate(&parentGraphExec, parentGraph, nullptr, nullptr, 0);

     // *****CHANGE 2: launch the graph in a regular stream rather than the default stream*****
     // *****CHANGE 2: launch the graph in a regular stream rather than the default stream*****
    cudaStream_t stream;
    cudaStreamCreate(&stream);
    cudaGraphLaunch(parentGraphExec, stream);
    cudaStreamSynchronize(stream);
    // cudaGraphLaunch(parentGraphExec, cudaStreamDefault);
    // cudaStreamSynchronize(cudaStreamDefault);

    // release resources
    return 0;
}

Thanks again.

Jack

Thanks for your suggestion.

But what I really want to do is execute the graph based on some conditions, in other words, whether the graph is executed or not is based on the particular conditions. Achieving this goal requires me to be able to launch a graph from the HOST node and then I can add some if statement to control whether or not to launch it.
The graph in the CHILD node will be launched anyway so it doesn’t meet my needs.

The conditional node type, supported by CUDA 12.3.1, is what I want. But I am using CUDA11.4 so I have to achieve it on my own.

Jack

The number of internal threads for host functions is unspecified. If there is only 1 thread for host functions, you have another dead lock where each of the two host functions waits for completion of the other host function.

You would need your own callback threads, and only use the cuda host function to communicate with those threads. Don’t perform cuda related operations in the host function, regardless of directly or indirectly.

However, the simplest solution would be to just synchronize after the first graph, then launch the second graph. If you have the full program implemented, you can profile it to see if it is worth it to spend time on refactoring this.

cudaGraphLaunch(graph1, stream);
cudaStreamSynchronize(stream);
if(condition){
    cudaGraphLaunch(graph2, stream);
}

I am sorry, I don’t understand your idea very well.
I hope to describe my code to make it clearer and I also put the complete code at the last. In the code, I create a graph(call it parent graph) containing a HOST node. Then I create another graph(call it child graph) containing another different HOST node. The HOST node of the parent graph creates a child thread(via pthread_create), then lets the child thread use cudaGraphlaunch() to launch the child graph. The program halts at cudaGraphSynchroize() in the child thread.

Yes, there is only 1 thread in the HOST node of the parent graph. But I don’t understand why each of the two host functions waits for completion of the other host function.

This is the complete code. The code can be compiled on CUDA 11.4 but halts at cudaGraphSynchroize() in the child thread after running.

#include<stdio.h>
#include<pthread.h>
#define CUDA_CHECK_ERROR(f, msg)\
{\
    cudaError_t error = f;\
    if (error != cudaSuccess) {\
        printf("File %s Line %d occurs error, error msg : %s, error code : %d, error name : %s, user msg : %s\n", __FILE__, __LINE__, cudaGetErrorString(error), error, cudaGetErrorName(error), msg);\
    }\
}
// called by child thread.
void* launchGraph(void* userData) {
    cudaGraphExec_t graphExec = *(cudaGraphExec_t*)userData;
    cudaStream_t stream;
    CUDA_CHECK_ERROR(cudaStreamCreate(&stream), "call cudaStreamCreate failed.");
    printf("after cudaStreamCreate.\n");
    CUDA_CHECK_ERROR(cudaGraphLaunch(graphExec, stream), "call cudaGraphLaunch failed.");
    printf("after cudaGraphLaunch.\n");
  //************The program halts here and can't proceed************
  //************The program halts here and can't proceed************
    CUDA_CHECK_ERROR(cudaStreamSynchronize(stream), "call cudaStreamSynchronize failed.");
    printf("after cudaStreamSynchronize.\n");
    cudaStreamDestroy(stream);
    return nullptr;
}
// the callback function of the HOST node
void hostNodeCallBack(void* userData) {
    // create a thread to launch the child executable graph
    pthread_t thread;
    pthread_create(&thread, NULL, launchGraph, userData);
    pthread_join(thread, nullptr);
}
void printSomething(void* userData) {
    printf("printSomething from child graph.\n");
}
int main() {
    // 1. create a child graph
    cudaGraph_t childGraph;
    cudaGraphExec_t childGraphExec;
    cudaGraphCreate(&childGraph, 0);
    cudaGraphNode_t node;
    // cudaGraphAddEmptyNode(&node, childGraph, NULL, 0);
    // *****CHANGE 1: add a host node rather than empty node*****
    // *****CHANGE 1: add a host node rather than empty node*****
    cudaHostNodeParams hostNodeParams;
    hostNodeParams.fn = printSomething;
    hostNodeParams.userData = nullptr;
    cudaGraphAddHostNode(&node, childGraph, NULL, 0, &hostNodeParams);
    cudaGraphInstantiate(&childGraphExec, childGraph, nullptr, nullptr, 0);
    // 2. create a parent graph containing a host node
    cudaGraph_t parentGraph;
    cudaGraphExec_t parentGraphExec;
    cudaGraphCreate(&parentGraph, 0);
    // 2.1 add a host node to parent graph
    cudaHostNodeParams params;
    params.fn = hostNodeCallBack;
    params.userData = &childGraphExec; // pass the child executable graph as argument
    cudaGraphAddHostNode(&node, parentGraph, NULL, 0, &params);
    // 2.2 launch parent graph
    cudaGraphInstantiate(&parentGraphExec, parentGraph, nullptr, nullptr, 0);

     // *****CHANGE 2: launch the graph in a regular stream rather than default Stream*****
     // *****CHANGE 2: launch the graph in a regular stream rather than default Stream*****
    cudaStream_t stream;
    cudaStreamCreate(&stream);
    cudaGraphLaunch(parentGraphExec, stream);
    cudaStreamSynchronize(stream);
    // cudaGraphLaunch(parentGraphExec, cudaStreamDefault);
    // cudaStreamSynchronize(cudaStreamDefault);

    // release resources
    return 0;
}

Jack

Try to draw the dependency graph. Assume there is one internal cuda thread T1 that executes all the host callback functions. Then there is thread T2 which you create. You have a circular dependency between T1 and T2. T1 waits untill T2 is done (pthread_join). T2 submits callback printSomething and waits until it is executed (cudaStreamSynchronize). But printSomething can only be executed by T1 and T1 is already busy, waiting for T2. So printSomething will never execute, cudaStreamSynchronize will never return, pthread_join will never return.

1 Like

Thanks! It hadn’t occurred to me that

If it is true, I can understand how the code causes deadlock.
I changed the HOST node to the KERNEL node in the child graph and kept the rest codes unchanged, the program exits correctly. So the assumption may be true.
I noticed you used the word Assume, why?
Could you tell me where you got this information or why you came up with this idea about Assume there is one internal cuda thread T1 that executes all the host callback functions.?

It is not specified how host callbacks are implemented. It is an implementation detail. But you can set up simple experiments with a profiler to get a rough idea. Last time I checked, one separate thread per device was used.

However we don’t need to know the concret implementation. The API description of cudaLaunchHostFunc CUDA Runtime API :: CUDA Toolkit Documentation

states that:

The host function must not perform any synchronization that may depend on outstanding CUDA work not mandated to run earlier. Host functions without a mandated order (such as in independent streams) execute in undefined order and may be serialized.

Not following this restriction can lead to issues like a dead-lock.

Thanks.
I still have three questions.
Q1:
Can I understand in this way that executing the callback function of a HOST node in a stream is similar(or equivalent?) to launching a host function into the stream? If these two ones have some relationship, I think it’s reasonable to obey the requirement of cudaLaunchHostFunc when coding the callback function of the HOST node.

Q2:

I don’t understand You would need your own callback threads. But aren’t the callback threads created by the CUDA runtime itself? I am confused.

Q3:

Yes, cudaStreamSynchronize and the stream can complete this kind of work.
But I still hope to integrate the control flow statements(like if) into the graph node because it can avoid the if statement separating the topology into too many graphs. I draw two pictures to make it clearer why I need it. To construct a topology by CUDA graph, I think there are two ways.

In figure1, if I can integrate if statement into the graph node(what I want to achieve in this post), I can construct a graph(named graph1) to contain the whole topology and then pass graph2, graph3 as arguments to two nodes which execute the corresponding graph depends on the particular conditions.
Figure1 With If Node, we only need three cudaGraph_t objects
In figure2, only depending on cudaStreamSynchronize, I need to divide the topology into different graphs, I may need six cudaGraph_t objects in this case.
Figure2 Without If Node, we may need six cudaGraph_t objects

Thanks in advance.
Jack

Q1: Yes, host nodes are the graph-equivalent of cudaLauncHostFunc.

Q2: Your own, manually managed threads are allowed to make cuda calls and to have dependencies on outstanding cuda work. The cuda runtime callback threads are not allowed to do this. You would use the cuda thread to tell your own thread what to do next, without waiting in the cuda thread for the completion of your manual thread.

Q3: In my opinion, Figure 2 is still more simple and less error-prone than to implement a substitute “if-node”

Thanks for your clear answers.
As for Q2:
I think your idea is similar to what I have done in the last code example in which the CUDA host thread launches a child thread to execute the child graph.
The only difference is that you require the CUDA host thread not to wait for the completion of the child thread, because it may cause deadlock. I hope my understanding is correct.
Actually, at the beginning, I hope the CUDA host thread proceeds after the completion of the child graph, so the CUDA host thread has to wait. But it seems impossible to implement now.
I ever think I can restrict the child graph not to contain any host node. In this way, maybe there is no deadlock.
But I still violate the requirements of cudaLaunchHostFunc which bans any synchronization in the host function.

In my opinion, I still think ‘if-node’ can make development convenient.
Could you please give me some advice, should I forget this kind of idea?

Thanks
Jack

With CUDA 11, you need to either launch the graph outside of a cuda host function, or use dynamic parallelism to launch kernels (without graph) from the device based on some condition.

It is up to you if you want to spend more time on this. I don’t have more suggestions.

Thanks, you have already helped me a lot.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.