Dynamic Parallelism with graphs

Hello.

In 11.6 cudaDeviceSynchronize was deprecated. In the CUDA 12 New Features, graphs were presented for dynamic parallelism. I am not sure how graphs work as I am new to the concept.

I was hoping to parallelize a depth first search algorithm using dynamic parallelism like so::

// this is non working code , just wanted to get the idea across

#include <iostream>
#include <cuda_runtime.h>

__global__ void processData(int* data, int *results, int itemtoprocess)
{
}
__global__ void rundfs(int *itemstosearch, int *data, int **placetostore, int *size_of_place_to_store, int tid)
{
     int itemtosearch = tid;
    // Process data...
    int *newArr = new int[100];
    process<<<block,thread>>>(data, newArr,itemtosearch);

    if (/*conditions met from processed data*/)
    {
        int **newplacetostore;
        int *newsize_of_place_to_store;
        int newtid = tid + 1;

        // Allocate memory for new placetostore and size_of_place_to_store
        int **newplacetostore = new *int[1000];
        //cudaMallocManaged(&newplacetostore, sizeof(int *));
        int *newsizeofplacetostore = new int[100];       //cudaMallocManaged(&newsize_of_place_to_store, sizeof(int));

        rundfs<<blocks, threads>>>(itemstosearch, data, newplacetostore, newsize_of_place_to_store, newtid);
        cudaDeviceSynchronize();

        *size_of_place_to_store += *newsize_of_place_to_store;

        // Free the allocated memory
        cudaFree(newplacetostore);
        cudaFree(newsize_of_place_to_store);
    }

    // Store the results
    placetostore[tid] = /*store here*/;
}

int main()
{
    int *itemstosearch;
    int *data;
    int **placetostore;
    int *size_of_place_to_store;

    // Allocate and initialize memory for itemstosearch, data, placetostore, and size_of_place_to_store
    // ...

    rundfs<<<blocks, threads>>>(itemstosearch, data, placetostore, size_of_place_to_store, 0);
    cudaDeviceSynchronize();

    // Process the results and clean up the memory
    // ...

  // copy results back and print them

    return 0;
}

How would I go on about doing this since I cannot call cudaDeviceSynchronize in the global code?

when posting code on this forum, please format it correctly. As a simple instruction, edit your post above by clicking on the pencil icon below it. In the edit window, select all the code. Then click the </> button at the top of the window. Then save your changes.

1 Like

cudaMallocManaged() is not available in device code.

There are going to be various ways to address this. I will mention one possibility (CDP). Other possibilities include CDP2 and CUDA Graphs. If time permits I will add some comments around those other options, but it will likely be some time from now.

I’d also like to say I’m generally skeptical when I see CDP solution methods and I am generally skeptical when I see kernel launches with <<<1,1>>>, but I’m setting all that aside.

With respect to CDP, we can possibly address the need here using stream semantics. Since you are already launching kernels (including launches with <<<1,1>>>) we can use that methodology to enforce ordering. I’m mostly focusing on your rundfs kernel at this point. The basic methodology would be:


__global__ void     more_process(...){ // this does everything within the if statement from rundfs kernel
    if (/*conditions met from processed data*/)
    {
        int *newsize_of_place_to_store;
        int newtid = tid + 1;



        rundfs<<<1, 1>>>(itemstosearch, data, newplacetostore, newsize_of_place_to_store, newtid);
        store_any_result_data<<<1,1>>>(...);
        free_anything_needed<<<1,1>>>(....);

    }

__global__ void rundfs(int *itemstosearch, int *data, int **placetostore, int *size_of_place_to_store, int tid)
{
    // Process data...
    int *newArr = new int[100];
    int **newplacetostore = new int *[1];
    process<<<block,thread>>>(data, newArr);
    more_process<<<1,1>>>(...); // this does everything within the if statement from rundfs kernel
    }


    // Store the results
    placetostore[tid] = /*store here*/;
}

Just like your code isn’t complete, neither is mine. The basic idea of launching this work into the null stream like this is that kernel2, launched after kernel1, will not begin until kernel1 is complete. Stream semantics guarantee that, without need for an explicit cudaDeviceSynchronize()

An important factoid to keep in mind is that a pointer to local memory in the parent kernel cannot be safely dereferenced in the child kernel. To avoid this, you allocate with new and pass the allocated pointer by value. When you need to free a pointer, pass that allocated pointer by value to a free kernel.

1 Like