Dynamic Parallelism extreme slowdown

Are there any strategies for mitigating the slowdown due to dynamic parallelism overhead? I’m currently seeing slow downs of 20x to over 100x just launching a single empty do nothing thread from my host launched kernels.

Sorry, meant to post this in the CUDA programming section. Uh, delete post feature request?