High CPU usage doing CUDA, cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync) ineffective

I’m using Jetpack 6.0 on a Jetson Orin Nano.

I have an application with a thread dedicated to a computer vision task on the GPU, and I noticed that “top” reports very high CPU usage for that thread. It should be almost 100% GPU, so the CPU load for that thread should be close to zero.

I ran Nsight profiler and found that this thread spends almost all of its time in the poll system call, so it must be getting called in a tight loop.

Investigating, I found that CUDA defaults to polling by default and the canonical solution is to call cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync) before doing anything CUDA-related. Unfortunately doing this has had no effect.

Googling this, I’ve seen complaints about this not working on ARM systems. Has anyone figured out how to force blocking on CUDA calls (particularly cudaStreamSynchronize)?

There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks
~0603

Hi,

cudaDeviceScheduleBlockingSync can work on Jetson.
Are you able to share the source code so we can check it further?

Thanks.