I’ve noticed that CPU usage is at 100% during execution of any long-running compute bound kernel. Is there a way to eliminate this wasteful spinning?
I’ve tried the following
- calling cudaSetDeviceFlags with cudaDeviceScheduleBlockingSync arg
- calling cudaSetDeviceFlags with cudaDeviceScheduleYield arg
- including #pragma wait
- calling acc_async_wait_all
Nothing worked.
I’m using PGI 16.10 on Tesla K20m.
Any advice appreciated, -Ondrej