cudaSetDeviceFlags and cudaDeviceScheduleYield in embedded envirronment

Tobbey · September 24, 2013, 3:07pm

Dear all,

I have successfully implemented a set of image processing algorithm that aim to be performed in real time, on a high end test plateform, with a gtx680.

But the targeted application aimed to take out the burden of image processing from the CPU (intel core2 on a current platform) to the GPU in order the CPU to be able to lower its consumption, or perform other tasks in parrallel.

Unfortunately, when porting my executable on a light target configuration, it appeared that the two application thread that were pushing computing tasks to their own GPU streams had a very high CPU utilisation.

Using Nvidia visual Profiler, I noticed that, while using low end GPU (gt610, gt240, etc…) the cpu threads were most of the time inside the cudaMemcpyAsync, waiting for the GPU to finish its computing tasks.
Using Intel profiler Vtune Amplifier, the following call hierarchy appeared:

MyApplication::CopyToHost
->cudaMemcpyAsync
->->libcudart.so.5.5.11
->->->clock_gettime

libcudart.so.5.5.11 and clock_gettime call consume almost all the CPU time of the application, this made me think of polling inside the API calls, asking for something from nvidia driver.

So, naturally, I tried to use cudaSetDeviceFlags with the different flag, and especially cudaDeviceScheduleYield and cudaDeviceScheduleBlockingSync, that should have solved all my problems:

“cudaDeviceScheduleYield: Instruct CUDA to yield its thread when waiting for results from the device. This can increase latency when waiting for the device, but can increase the performance of CPU threads performing work in parallel with the device.”

“cudaDeviceScheduleBlockingSync: cudaDeviceScheduleBlockingSync: Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the device to finish work.”

I tested all flags, and each time I got the same CPU utilization.

My question now, is, am I using cudaSetDeviceFlags properly ?
Currently, I have a pool of two threads, and the first thread to be ready initialize the device, create streams, allocate buffers, and stores them as context: structure containing cudaStreams and cudaBuffers.
The pool of context could be accessed through threadsafe context distributor.

I don’t know if the second thread, that do not set the cudaDevice itself, executes with the same behaviour, specified in the DeviceFlags.

The other alternative would be that the management of these flag is not implemented in cuda library.

Did someone here have experienced the impact of the cudaFlags on the behaviour of cpu threads executing Asynchronous kernel launching, or asynchronous cudaMemCpy ?

Thank you in advance

Tobbey · September 24, 2013, 4:08pm

Sorry this post belong to Cuda Programming and Performance

Topic		Replies	Views
cudaSetDeviceFlags and cudaDeviceScheduleYield in embedded envirronment CUDA Programming and Performance	2	3403	September 27, 2013
CPU core is busy while GPU runs its kernel CUDA Programming and Performance	11	5277	February 11, 2018
cudaDeviceScheduleBlockingSync & multi-gpu How to use BlockingSync w/ multiple devices? CUDA Programming and Performance	3	6599	April 13, 2011
letting the host thread sleep in 2.2? CUDA Programming and Performance	8	4347	July 1, 2009
100% CPU Usage - Linux CUDA Programming and Performance	4	2764	February 12, 2018
cpu usage while waiting for kernel CUDA Programming and Performance	4	8943	August 1, 2009
CPU Usage CUDA Programming and Performance	6	1726	October 5, 2009
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23324	October 7, 2010
How to yield CPU When I use cudaEventSynchronize? CUDA Programming and Performance	0	646	October 14, 2014
Asking for clarification (Thread yield or block) CUDA Programming and Performance	2	1389	February 1, 2012

cudaSetDeviceFlags and cudaDeviceScheduleYield in embedded envirronment

Related topics