PyCuDA how to enable SCHED_YIELD to reduce CPU usage

I am running inference using tensorRT. While the CPU is waiting for the tensorRT async inference, the CPU usage remains high. My goal is to free the CPU usage so that the CPU can be used by other thread.

My understanding is that I would need to use the flag SCHED_YIELD. However, I am not doing that properly since it doesn’t seems to have the desired effect.

My code:

import pycuda.driver as cuda
cuda.init()
cuda.Device(0).make_context(flags=cuda.ctx_flags.SCHED_YIELD)

If I change the flag to cuda.ctx_flags.SCHED_SPIN I get the same performance.

I also tried to initialize cuda like this:
cuda.init(flags=cuda.ctx_flags.SCHED_YIELD)
but it throws the error
pycuda._driver.LogicError: cuInit failed: invalid argument

How can I enable the desired setting? Is it SCHED_YIELD what I really need?

I am using this code to run the inference: https://stackoverflow.com/a/56668893/1315621 but I adapted it: I am using multiple threads (therefore multiple contexts) calling tensorRT inference from multiple threads

Hi @mfoglio,

This looks like CUDA issue, hence we recommend you to raise it in the respective forum.

Thank you.