Dynamic Parallelism extreme slowdown

Are there any strategies for mitigating the slowdown due to dynamic parallelism overhead? I’m currently seeing slow downs of 20x to over 100x just launching a single empty do nothing thread from my host launched kernels.