__global__ void childKernel(int subDomainId) {
SubdomainData sub = subdomains[subDomainId];
// ... do computation ...
// Commit results to global memory (one thread)
sub.time += dt;
subdomains[subDomainId] = sub;
}
__global__ void parentKernel(SubdomainData* subdomains) {
int subDomainId = threadIdx.x + blockIdx.x * blockDim.x;
if (subDomainId >= NUM_SUBDOMAINS) return;
SubdomainData sub = subdomains[subDomainId];
while (sub.time < sub.targetTime) {
// Launch multiple child kernels on the subdomain stream
childKernel<<<childGridDim, childBlockDim>>>(subDomainId);
...
childKernelN<<<..., cudaStreamTailLaunch>>>(subDomainId);
}
}
This code is part of a larger adaptive mesh refinement (AMR) simulation, where I aim to avoid any synchronization with the host.
The parent kernel launches several child kernels in a loop until the subdomain’s target time is reached.
Each child updates a global value sub.time that the parent reads to decide whether to continue.
The programming guide states:
To access modifications made by child_launch, a tail_launch kernel is launched into the cudaStreamTailLaunch stream.
Does launching into cudaStreamTailLaunch by itself guarantee that the parent kernel will see up-to-date data written by the child kernels? If so, is it possible to create separate streams for each parent thread to improve kernel concurrency?
What is the best way to ensure the parent thread always reads up-to-date time data written by the child kernels?
The parent kernel cannot reliably observe modifications made by the child kernels.
All global memory operations in the parent thread prior to the child grid’s invocation are visible to the child grid. With the removal of cudaDeviceSynchronize(), it is no longer possible to access the modifications made by the threads in the child grid from the parent grid.
As explained in the programming guide and quoted by you, you have to launch a a different kernel using cudaStreamTailLaunch, which then can observe the modifications of the child kernels.
Maybe in your case, something like the following might work.
__global__ void childLauncher{
SubdomainData sub = subdomains[subDomainId];
if (sub.time < sub.targetTime){
launch child kernels
...
launch childLauncher in cudaStreamTailLaunch (to observe updated times)
}
}
Unfortunately not, it doesn’t resolve my problem as the loop is in the parent kernel. Is there any other strategy that is better than launching child kernels directly from the host and synching? Active polling perhaps?
It’s not reliable to read global data in a parent kernel and expect that it will show updates from a child kernel. One of the requirements for “reliability” would be some sort of synchronization guarantee. The only synchronization guarantees available in CDP 2.0 are stream based, with all that that implies.
“stream based” - “Items issued into a stream execute in issue order. Item B, issued into a stream, will not begin executing until item A, issued previously into that stream, has finished executing”
The parent kernel code is not part of any stream that the parent kernel can issue child work into. CDP 2.0 imagines that one possible outcome is that all child work may not begin until the parent kernel code has completed.
You might want to revisit the suggestion given by striker159. It will require some refactoring, but having a recursive launch process from the tail launch stream is one possible method to ensure memory “consistency”, for device code testing memory data from previously launched device kernels.
Yes, it will require refactoring, you would change the realization from a loop in the parent kernel to recursive/nested launch, where each launch accomplishes one loop iteration, proceeding until the target time is reached.
@Robert_Crovella you’re right! @striker159 thanks, this is actually a solution I can definitely work with.
If I understand correctly, all child kernels from earlier recursion levels are completed before the next child-launching kernel runs (including the parent itself), which means the total number of active kernels is always bounded.
Does the maximum recursion depth still apply in this situation?
I can create a hybrid version of the two approaches, since there’s a minimum number of time steps - determined by the resolution difference from the parent domain - that I already know in advance.