Shared memory with Cuda Dynamic Parallelism(CDP ) and CDP 2

I am a little unclear on documentation on allocating shared memory within child kerenls with CDP 1 and 2

  • Does shared memory size of child kernels always inherit the shared memory size. (Note: I am asking about size as shared memory isn’t really shared between child and parent and this is clear from documentation). I have seen issues in CDP1 where I had to explicitly set size <<<1,256, shared_size_child >>> or it didn’t work correctly on CUDA 12.0 with sm 70 arch

  • What are the limits to the thread block size when launching child kernels

  • With CDP 2 a follow-up question on the sentence below : The only way to access the modifications made by the threads in the child grid before the parent grid exits is via a kernel launched into the cudaStreamTailLaunch stream Does this mean that if I launch 2 child kernels from the parent kernel and the second child kernel has cudaStreamTailLaunch then the parent kernel also can see modifications to global memory made by the first kernel ? or only the second kernel launch which has the cudaStreamTailLaunch can see the modifications of the first child kernel. I thought it was the latter but the following sentence from the documentation is a little unlcear: It should be possible to to use a tail launch to achieve the same functionality as a cudaDeviceSynchronize() in most cases

I wasn’t aware of any differences for allocating shared memory using CDP vs. using ordinary host launch.

I don’t see that in the documentation anywhere. Certainly with respect to sizes of 48KB or less, I’m not aware of any issues with shared memory size of a child kernel.

They are the same as when launching kernels from the host.

I don’t see that in the documentation anywhere.

Its mentioned right here CUDA Dynamic Parallelism API and Principles | NVIDIA Technical Blog

A child grid inherits from the parent grid certain attributes and limits, such as the L1 cache / shared memory configuration and stack size.

So have you answered your own question then?

Refer to my first point. The documentation I pointed to says it will inherit. But in CDP 1 and 2 with cuda 12.0 I did not see this behaviour. It did not inherit the size and i had to set it explicitly. So either the documentation was wrong from the start or the behaviour changed in some version of CUDA ?
Also from my first post the 3rd point is also not answered

That blog predates the introduction of CDP2 by a long way. Typically, in the 2014 timeframe, when you are referring to " the L1 cache / shared memory configuration " you were referring to a split that I don’t think most folks are using anymore, and AFAIK is not relevant for most modern GPUs. That term is not referring to the size of shared memory that you ask for via kernel launch.

For shared usage in a child kernel of 48KB or less, you have to set the shared usage explicitly, just as you would with any other kernel launch. You are interpreting something that was not intended by that excerpt from that blog.

For usage beyond 48KB, I would have to refresh my memory and do some checking w.r.t. CDP. In a host side launch there is some configuration that has to be enabled to make it possible. Even in that case, the size expected must be specified explicitly, at kernel launch, for every launch.

My recommendation would be to cite links from all excerpts you are quoting or excerpting.

The 3rd question’s highlighted sentences are entirely from CUDA 12.0 documentation
https://docs.nvidia.com/cuda/archive/12.0.0/cuda-c-programming-guide/index.html#execution-configuration

No, it does not mean that. It means that since cudaStreamTailLaunch is always scheduled after “completion” of the parent kernel, but before the implicit sync point signified by parent kernel completion, then modifications from any “fire and forget” or ordinary created stream child kernel launched via the parent kernel will be visible in that kernel that was launched into the Tail Stream.

Since the Tail Stream has these particular semantics:

  1. Does not begin until all other work associated with the parent kernel is complete.
  2. Begins and executes prior to the implicit sync point associated with the parent kernel completion

then it can be used semantically “as if” there were a cudaDeviceSynchronize() between the work associated with the parent kernel (including the child kernels previously identified above - in ordinary and “fire and forget” streams), and the work associated with the child kernel launched into the tail stream.

1 Like

Thanks. Makes sense. Even though both kernels are launched to the NULL stream it is not guarenteed and it follows the first pic attached. Whereas if I do want strict ordering I should be using Named stream


according to doc

Note that while named streams are shared by all threads within a grid, the implicit NULL stream is only shared by all threads within a thread block. If multiple threads in a thread block launch into the implicit stream, then these launches will be executed in-order. If multiple threads in different thread blocks launch into the implicit stream, then these launches may be executed concurrently. If concurrency is desired for launches by multiple threads within a thread block, explicit named streams should be used.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.