How to reduce CPU usage while waiting for kernels to finish in OpenCL on nvidia gpu?

evantandersen · January 16, 2020, 12:23am

I’m working on a cryptocurrency mining implementation in OpenCL and having trouble getting it to play nice with the Nvidia OpenCL driver. The problem is that the NVIDIA driver waits for kernels to finish using spinlocks. When you have 6-20 GPUs hanging off a single dual core CPU, this causes the entire system to grind to a halt.

I’ve tried many solutions, but they all seem to break on windows, while working nicely on linux:

a) Time the kernels and sleep the thread for ~95% of the time it takes to execute, to reduce spinlock to 5% of total kernel time. This fails because putting the thread to sleep immediately after enqueuing the kernel causes it to not start executing until after the sleep call. I then tried adding a clFlush() call to ensure the kernel would start, but the clFlush() call seems to block until the kernel completes, negating the entire purpose.

b) Use a non-blocking readbuffer call, followed by polling the event status and sleeping in a loop. This fails because the clEnqueueReadBuffer() call seems to completely ignore the boolean block flag, always waiting for the kernel to be done before continuing.

Ideas?

Robert_Crovella · January 16, 2020, 1:20am

clEnqueueNDRangeKernel allows you to specify an event that will indicate completion of the kernel being enqueued. Have you tried polling that event?

evantandersen · January 16, 2020, 1:38pm

I had not tried that. It doesn’t seem to help unfortunately. The status stays at “Queued” until clEnqueueReadBuffer is called, which then waits for the kernel to complete (with or without the blocking flag set). After the readbuffer call, the kernel has status complete (which is true, but not particularly useful).

evantandersen · January 16, 2020, 3:24pm

Ok, I have solved the problem. The openCL bindings I was using had a bug, where both Flushing and Finishing a queue called clFinish() under the hood! Fixed that, and then implemented method a) and it’s working well!

petriraa · November 15, 2023, 10:23pm

I used to have a linux .so library that replaced sched_yield() with nanosleep(0). That helped a lot with NVIDIA OpenCL libraries.

sched_yield lets processes with lower priority to run. nanosleep lets any process that needs to!

Petri33

Topic		Replies	Views
nVidia OpenCL runtime implementation treats async buffer read as sync one CUDA Programming and Performance	2	1543	November 23, 2016
OpenCL Asynchronous Kernel Launches CUDA Programming and Performance	6	2830	May 24, 2022
Execute kernels without 100% CPU busy-wait? CUDA Programming and Performance	12	6558	August 19, 2011
Hard lockup when calling clFinish() Am I doing the right thing? CUDA Programming and Performance	3	7772	October 26, 2010
OpenCL busy wait still not fixed CUDA Programming and Performance	20	3011	July 31, 2018
nvopencl.dll CPU usage not going down when sleeping CUDA Programming and Performance	0	736	March 21, 2016
Hang in clFlush() on Quadro K5000 after GL/CL interop Linux	0	649	June 10, 2019
cudaDeviceScheduleSpin with OpenCL How to let OpenCL actively spin for return of kernel CUDA Programming and Performance	0	12778	March 3, 2011
launch kernels in parallel? CUDA Programming and Performance	16	24182	July 29, 2010
OpenCL Blocking occurs when clEnqueueWriteBuffer and clEnqueueNDRangeKernel are used together CUDA Programming and Performance camera , cuda , kernel	2	526	March 14, 2023

How to reduce CPU usage while waiting for kernels to finish in OpenCL on nvidia gpu?

Related topics