cudaLaunchHostFunc blocking work on Linux

I’m trying to do some runtime instrumentation, and have played a bit with events and cudaLaunchHostFunc. One issue with events is that if I run a bunch of very fast things, I seem to bump into the granularity of the event timers, so if I have 1000 things that take 400ns each, I get a report of essentially 0 ns, whereas I would hope for 400us.

Due to that, I’m trying the less performant cudaLaunchHostFunc, but I’m finding that it seems to synchronize streams, which is really not ideal (perhaps callbacks are made from a single CPU thread?). I found this post:

Which seems to indicate the hardware scheduling could fix this. I’m unfortunately unable to find how to turn on this feature in Linux. Suggestions?

hardware accelerated GPU scheduling pertains to windows only. As indicated in that thread already, the issue being reported in that thread was observed in windows, but when tested on linux, the issue did not manifest (as indicated in the 2nd post in that thread). There is no corresponding switch or control in linux.

Interesting. I’m definitely observing serialization. Bummer that this can’t be controlled from Linux. Thanks for the response.