Shared memory with Cuda Dynamic Parallelism(CDP ) and CDP 2

bsam89 · January 4, 2024, 2:15pm

I am a little unclear on documentation on allocating shared memory within child kerenls with CDP 1 and 2

Does shared memory size of child kernels always inherit the shared memory size. (Note: I am asking about size as shared memory isn’t really shared between child and parent and this is clear from documentation). I have seen issues in CDP1 where I had to explicitly set size <<<1,256, shared_size_child >>> or it didn’t work correctly on CUDA 12.0 with sm 70 arch
What are the limits to the thread block size when launching child kernels
With CDP 2 a follow-up question on the sentence below : The only way to access the modifications made by the threads in the child grid before the parent grid exits is via a kernel launched into the cudaStreamTailLaunch stream Does this mean that if I launch 2 child kernels from the parent kernel and the second child kernel has cudaStreamTailLaunch then the parent kernel also can see modifications to global memory made by the first kernel ? or only the second kernel launch which has the cudaStreamTailLaunch can see the modifications of the first child kernel. I thought it was the latter but the following sentence from the documentation is a little unlcear: It should be possible to to use a tail launch to achieve the same functionality as a cudaDeviceSynchronize() in most cases

Robert_Crovella · January 5, 2024, 3:23pm

I wasn’t aware of any differences for allocating shared memory using CDP vs. using ordinary host launch.

I don’t see that in the documentation anywhere. Certainly with respect to sizes of 48KB or less, I’m not aware of any issues with shared memory size of a child kernel.

They are the same as when launching kernels from the host.

bsam89 · January 5, 2024, 3:26pm

I don’t see that in the documentation anywhere.

Its mentioned right here CUDA Dynamic Parallelism API and Principles | NVIDIA Technical Blog

A child grid inherits from the parent grid certain attributes and limits, such as the L1 cache / shared memory configuration and stack size.

Robert_Crovella · January 5, 2024, 3:28pm

So have you answered your own question then?

bsam89 · January 5, 2024, 3:31pm

Refer to my first point. The documentation I pointed to says it will inherit. But in CDP 1 and 2 with cuda 12.0 I did not see this behaviour. It did not inherit the size and i had to set it explicitly. So either the documentation was wrong from the start or the behaviour changed in some version of CUDA ?
Also from my first post the 3rd point is also not answered

Robert_Crovella · January 5, 2024, 3:34pm

That blog predates the introduction of CDP2 by a long way. Typically, in the 2014 timeframe, when you are referring to " the L1 cache / shared memory configuration " you were referring to a split that I don’t think most folks are using anymore, and AFAIK is not relevant for most modern GPUs. That term is not referring to the size of shared memory that you ask for via kernel launch.

For shared usage in a child kernel of 48KB or less, you have to set the shared usage explicitly, just as you would with any other kernel launch. You are interpreting something that was not intended by that excerpt from that blog.

For usage beyond 48KB, I would have to refresh my memory and do some checking w.r.t. CDP. In a host side launch there is some configuration that has to be enabled to make it possible. Even in that case, the size expected must be specified explicitly, at kernel launch, for every launch.

Robert_Crovella · January 5, 2024, 4:04pm

My recommendation would be to cite links from all excerpts you are quoting or excerpting.

bsam89 · January 5, 2024, 4:19pm

The 3rd question’s highlighted sentences are entirely from CUDA 12.0 documentation
https://docs.nvidia.com/cuda/archive/12.0.0/cuda-c-programming-guide/index.html#execution-configuration

Robert_Crovella · January 5, 2024, 4:35pm

No, it does not mean that. It means that since cudaStreamTailLaunch is always scheduled after “completion” of the parent kernel, but before the implicit sync point signified by parent kernel completion, then modifications from any “fire and forget” or ordinary created stream child kernel launched via the parent kernel will be visible in that kernel that was launched into the Tail Stream.

Since the Tail Stream has these particular semantics:

Does not begin until all other work associated with the parent kernel is complete.
Begins and executes prior to the implicit sync point associated with the parent kernel completion

then it can be used semantically “as if” there were a cudaDeviceSynchronize() between the work associated with the parent kernel (including the child kernels previously identified above - in ordinary and “fire and forget” streams), and the work associated with the child kernel launched into the tail stream.

bsam89 · January 5, 2024, 4:52pm

Thanks. Makes sense. Even though both kernels are launched to the NULL stream it is not guarenteed and it follows the first pic attached. Whereas if I do want strict ordering I should be using Named stream

according to doc

Note that while named streams are shared by all threads within a grid, the implicit NULL stream is only shared by all threads within a thread block. If multiple threads in a thread block launch into the implicit stream, then these launches will be executed in-order. If multiple threads in different thread blocks launch into the implicit stream, then these launches may be executed concurrently. If concurrency is desired for launches by multiple threads within a thread block, explicit named streams should be used.

system · January 19, 2024, 4:53pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Dynamic Parallelism synchronization between kernel launches CUDA Programming and Performance	5	63	February 5, 2025
CDP2 with dynamic grid size of the next child kernel CUDA Programming and Performance	4	61	February 15, 2025
Dynamic Parallelism parent and child memory consistency CUDA Programming and Performance	2	689	November 11, 2015
Dynamic Parallelism Memory Consistency across Thread Blocks? CUDA Programming and Performance	11	1753	February 5, 2015
Code optimization with CDP and dynamic shared memory allocation CUDA Programming and Performance	18	56	January 13, 2025
Got wrong result when not using cudaDeviceSynchronize in threads CUDA Programming and Performance	6	838	February 1, 2024
shared memory problems CUDA Programming and Performance	12	4982	October 2, 2010
Using Shared Memory in CUDA C/C++ Technical Blog	36	1982	October 8, 2020
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2259	May 30, 2009
device streams CUDA Programming and Performance	10	4244	February 7, 2016

Shared memory with Cuda Dynamic Parallelism(CDP ) and CDP 2

Related topics