Starting a nested kernel is slower than the alternative

Working on a raycasting renderer, and I’m currently implementing antialiasing. I have a big render_kernel that handles all of my rendering, and my naive approach was to put an antialiasing loop inside of render_kernel (essentially running the kernel a few times in a row). Once I got this working, I moved the antialiasing code to another kernel, antialiasing_kernel, that would be called within render_kernel. The goal was to speed up antialiasing by doing it in parallel, however this was actually multiple times slower than my naive approach. Is starting nested kernels a slow operation?

calling a kernel from device code has roughly the same overhead (~5-50us/call) as calling a kernel from host code.

There may be other factors that affect performance that are impossible to judge without profiling. Stopping or displacing a large, efficiently running kernel to run one or several small, inefficiently running kernel(s) may be an unwise idea; the GPU does not have infinite capacity.