I have an issue that launching individual kernels under heavy CPU load can take 1-20ms.
I launch a series of asynchronous CUDA requests (about 100) and the time for everything varies between 70-500ms, depending on the CPU load. This is just for the asynchronous launches, no other CPU computation is performed between individual launch commands.
I’m using the very old tegra K1 platform, so in-kernel launches are not available.
Is there anything I can do to speed it up?