Dynamic Parallelism memory consistency

hrvthzs · October 8, 2025, 1:17pm

I have the following setup:

__global__ void childKernel(int subDomainId) {
    SubdomainData sub = subdomains[subDomainId];

    // ... do computation ...

    
    // Commit results to global memory (one thread)
    sub.time += dt;
    subdomains[subDomainId] = sub;
}

__global__ void parentKernel(SubdomainData* subdomains) {
    int subDomainId = threadIdx.x + blockIdx.x * blockDim.x;
    if (subDomainId >= NUM_SUBDOMAINS) return;

    SubdomainData sub = subdomains[subDomainId];

    while (sub.time < sub.targetTime) {

        // Launch multiple child kernels on the subdomain stream
        childKernel<<<childGridDim, childBlockDim>>>(subDomainId);
        ...
        childKernelN<<<..., cudaStreamTailLaunch>>>(subDomainId);
    }
}

This code is part of a larger adaptive mesh refinement (AMR) simulation, where I aim to avoid any synchronization with the host.

The parent kernel launches several child kernels in a loop until the subdomain’s target time is reached.
Each child updates a global value sub.time that the parent reads to decide whether to continue.

The programming guide states:

To access modifications made by child_launch, a tail_launch kernel is launched into the cudaStreamTailLaunch stream.

Does launching into cudaStreamTailLaunch by itself guarantee that the parent kernel will see up-to-date data written by the child kernels? If so, is it possible to create separate streams for each parent thread to improve kernel concurrency?

What is the best way to ensure the parent thread always reads up-to-date time data written by the child kernels?

striker159 · October 8, 2025, 1:44pm

The parent kernel cannot reliably observe modifications made by the child kernels.

All global memory operations in the parent thread prior to the child grid’s invocation are visible to the child grid. With the removal of cudaDeviceSynchronize(), it is no longer possible to access the modifications made by the threads in the child grid from the parent grid.

As explained in the programming guide and quoted by you, you have to launch a a different kernel using cudaStreamTailLaunch, which then can observe the modifications of the child kernels.

Maybe in your case, something like the following might work.

__global__ void childLauncher{
    SubdomainData sub = subdomains[subDomainId];

    if (sub.time < sub.targetTime){
        launch child kernels
        ...
        launch childLauncher in cudaStreamTailLaunch (to observe updated times)
    }
}

hrvthzs · October 8, 2025, 1:51pm

Unfortunately not, it doesn’t resolve my problem as the loop is in the parent kernel. Is there any other strategy that is better than launching child kernels directly from the host and synching? Active polling perhaps?

Robert_Crovella · October 8, 2025, 3:48pm

It’s not reliable to read global data in a parent kernel and expect that it will show updates from a child kernel. One of the requirements for “reliability” would be some sort of synchronization guarantee. The only synchronization guarantees available in CDP 2.0 are stream based, with all that that implies.

“stream based” - “Items issued into a stream execute in issue order. Item B, issued into a stream, will not begin executing until item A, issued previously into that stream, has finished executing”

The parent kernel code is not part of any stream that the parent kernel can issue child work into. CDP 2.0 imagines that one possible outcome is that all child work may not begin until the parent kernel code has completed.

Curefab · October 10, 2025, 11:43pm

There are two strategies to avoid dynamic parallelism:

A) Move the loop into the stream. Launch several kernels into the stream or synchronize between several streams.

B) Instead of launching child kernels, stay inside the parent kernel and do the work there. Reuse the threads from the parent kernel.

Robert_Crovella · October 13, 2025, 1:48pm

You might want to revisit the suggestion given by striker159. It will require some refactoring, but having a recursive launch process from the tail launch stream is one possible method to ensure memory “consistency”, for device code testing memory data from previously launched device kernels.

Yes, it will require refactoring, you would change the realization from a loop in the parent kernel to recursive/nested launch, where each launch accomplishes one loop iteration, proceeding until the target time is reached.

hrvthzs · October 13, 2025, 2:13pm

@Robert_Crovella you’re right! @striker159 thanks, this is actually a solution I can definitely work with.
If I understand correctly, all child kernels from earlier recursion levels are completed before the next child-launching kernel runs (including the parent itself), which means the total number of active kernels is always bounded.

Does the maximum recursion depth still apply in this situation?

I can create a hybrid version of the two approaches, since there’s a minimum number of time steps - determined by the resolution difference from the parent domain - that I already know in advance.

__global__ void childLauncher{
    SubdomainData sub = subdomains[subDomainId];

    if (sub.time < sub.targetTime){
        for(i=0; i< minNumberOfSubSteps; ++i)
        {
          launch child kernels
          ...
        }
        launch childLauncher in cudaStreamTailLaunch (to observe updated times)
    }
}

system · October 27, 2025, 2:14pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Dynamic Parallelism Memory Consistency across Thread Blocks? CUDA Programming and Performance	11	1874	February 5, 2015
Dynamic Parallelism synchronization between kernel launches CUDA Programming and Performance	5	255	February 5, 2025
Memory Synchronisation if using dynamic parallelism CUDA Programming and Performance	0	376	December 22, 2020
Dynamic Parallelism parent and child memory consistency CUDA Programming and Performance	2	754	November 11, 2015
Persistent kernels and Dynamic Parallelism CUDA Programming and Performance	4	199	August 8, 2025
Dynamic Parallelism, HyperQ and cudaDeviceSynchronize() CUDA Programming and Performance	1	1080	June 23, 2016
Dynamic parallelism, Kernel didn't launch CUDA Programming and Performance	6	1976	September 12, 2016
Is dynamic parallelism suitable for this application? CUDA Programming and Performance	3	1257	August 20, 2013
Synchronization in Dynamic Parallelism CUDA Programming and Performance cuda	3	1144	October 17, 2023
About cudaStreamTailLaunch CUDA Programming and Performance cuda	4	203	July 3, 2025

Dynamic Parallelism memory consistency

Related topics