100% CPU Usage - Linux

Hi there,

I created a kernel which took 10ms to execute in order to see what is the CPU behavior when a kernel is running on a GPU.
Thanks to some topics on this forum I understood that the CPU is spinning to wait the end of a GPU process (default behavior). Thus it’s using 100 % of the CPU.
I learned that when you are using cudaSetDeviceFlags with the cudaDeviceScheduleBlockingSync flag, the process will be blocked and the CPU could be used for something else. It means that the cpu usage is lower than the spinning mode.

But when I do this trick, it doesn’t change a thing to the cpu usage, I still have 100 %.

Someone guessed in the bellow link that those flags might not be implemented in Linux.

The question is : Still this feature not implemented in Linux api ? Or what I’m doing wrong.

Information about my test :

  • Nvidia GTX 950
  • RHEL 6.0
  • CUDA 7.5
  • cudaSetDeviceFlags is used before all context creation

It’s definitely implemented. If I put


at the start of a compute-heavy CUDA program, CPU usage drops from 100% to less than 1% (as reported by top).
This is on CUDA 8.0, driver 367.44 on openSUSE tumbleweed.

Thank you for your reply.

I have CUDA 7.5 and use the driver : 352.39
Do you think it doesn’t work because of my settings ? Maybe a bug inside CUDA 7.5 or my driver which doesn’t implement the flags behavior.

When I’m using cudaGetDeviceFlags it returns 8
After using cudaSetDeviceFlags with cudaDeviceScheduleBlockingSync it returns 12.
As I can see, the flag is correctly set but the cpu usage isn’t lower during my kernel execution.

I remember having had similar frustration while working on a background application for crypto mining (cudaminer).

What I ended up doing was to insert sleep commands into my code that would put the thread to sleep for about 95% of the kernel execution time. Some kind of feedback control loop was used to adjust the sleep time to match this target.

I ended up getting very low CPU usage and no notable performance hit.

Of course this strategy only works for kernels with a predictable execution time per iteration.


I found you have to call


before calling cudaSetDeviceFlags(…), otherwise it has no effect whatsoever.

See the documentation http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#ixzz56qIPnXa2 :
“If no device has been made current to the calling thread, then flags will be applied to the initialization of any device initialized by the calling host thread, unless that device has had its initialization flags set explicitly by this or any host thread.”

Which essentially means “If no device has been made current to the calling thread [before calling cudaSetDeviceFlags()] undefined behavior results in case of a multi-threaded application”.