Multiple independent streams in a graph

Hi There,

I have 10 streams which works in parallel without any dependencies on each other. The simpleCudaGraph example shows how to add multiple streams in a graph by recording event for stream1 and firing it to trigger execution of stream2. Since I do not want any dependencies, how can I add multiple streams in a single graph? Or should I create 10 separate graphs responsible for each stream?

Could you please let me know which approach is better?


Using the API graph capture, you can express the independent work being done by your 10 streams without actually using 10 streams.

The Graph system will create the equivalent of streams internally to handle all work that you indicate can be done concurrently.

Without knowing anything else about your code, that is the approach I would suggest.

Thank you for your response.

Actually I want Graph to make use of my existing 10 streams on which I can do job queue management. If Graph is going to manage streams internally then it will be difficult for me to distribute workload between them as I am playing with pointer indices of big data structure.

I observed one more issue with graph. I would like to pass pointer to data structure (int * in below example) to my kernel and capture it via cudaStreamBeginCapture(). And for remaining LOOP_COUNT iteration, I am trying to change pointer address by some offset by calling cudaGraphLaunch(). But my kernel is always receiving first pointer address to data structure which is passed between cudaStreamBeginCapture() and cudaStreamEndCapture(). Is there any way I can send updated parameters to kernel while calling cudaGraphLaunch?

Thanks in advance!

int *deviceInputs;
    long *outputs, *deviceOutput;

    cudaStream_t stream1;
    cudaGraph_t graph;


    cudaMemcpyAsync(deviceInputs, inputs, inputSize * sizeof(int), cudaMemcpyDefault, stream1);
    cudaMemcpyAsync(deviceOutput, outputs, inputSize * sizeof(long), cudaMemcpyDefault, stream1); 

    int *temp = deviceInputs + 0;


    addition<<<(inputSize + 255) / 256, 256, 0, stream1>>>(inputSize, temp, deviceOutput);

    cudaMemcpyAsync(outputs, deviceOutput, inputSize * sizeof(long), cudaMemcpyDefault, stream1);

    cudaStreamEndCapture(stream1, &graph);

    cudaGraphExec_t graphExec;
    cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);

    for (int i = 0; i < LOOP_COUNT; i++)
        temp = deviceInputs + i;
        printf("\nPassed pointer = %p, i=%d", temp, i);

        cudaGraphLaunch(graphExec, stream1);