Kernel launches under heavy CPU load are very slow

I have an issue that launching individual kernels under heavy CPU load can take 1-20ms.

I launch a series of asynchronous CUDA requests (about 100) and the time for everything varies between 70-500ms, depending on the CPU load. This is just for the asynchronous launches, no other CPU computation is performed between individual launch commands.

I’m using the very old tegra K1 platform, so in-kernel launches are not available.

Is there anything I can do to speed it up?

Jetson TK1 had a variety of settings that could be made to maximize CPU performance. Have you applied those?

Yes I’ve played with those. They are already tuned for maximum performance.

I think every kernel launch causes a context switch. I see an ioctl system call that probably triggers it. Is there a way to avoid that?

I’m not aware of any way.

You might want to ask any further questions on one of the jetson forums.