cudaSetDeviceFlags and cudaDeviceScheduleYield in embedded envirronment

Dear all,

I have successfully implemented a set of image processing algorithm that aim to be performed in real time, on a high end test plateform, with a gtx680.

But the targeted application aimed to take out the burden of image processing from the CPU (intel core2 on a current platform) to the GPU in order the CPU to be able to lower its consumption, or perform other tasks in parrallel.

Unfortunately, when porting my executable on a light target configuration, it appeared that the two application thread that were pushing computing tasks to their own GPU streams had a very high CPU utilisation.

Using Nvidia visual Profiler, I noticed that, while using low end GPU (gt610, gt240, etc…) the cpu threads were most of the time inside the cudaMemcpyAsync, waiting for the GPU to finish its computing tasks.
Using Intel profiler Vtune Amplifier, the following call hierarchy appeared:

->->->clock_gettime and clock_gettime call consume almost all the CPU time of the application, this made me think of polling inside the API calls, asking for something from nvidia driver.

So, naturally, I tried to use cudaSetDeviceFlags with the different flag, and especially cudaDeviceScheduleYield and cudaDeviceScheduleBlockingSync, that should have solved all my problems:

“cudaDeviceScheduleYield: Instruct CUDA to yield its thread when waiting for results from the device. This can increase latency when waiting for the device, but can increase the performance of CPU threads performing work in parallel with the device.”

“cudaDeviceScheduleBlockingSync: cudaDeviceScheduleBlockingSync: Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the device to finish work.”

I tested all flags, and each time I got the same CPU utilization.

My question now, is, am I using cudaSetDeviceFlags properly ?
Currently, I have a pool of two threads, and the first thread to be ready initialize the device, create streams, allocate buffers, and stores them as context: structure containing cudaStreams and cudaBuffers.
The pool of context could be accessed through threadsafe context distributor.

I don’t know if the second thread, that do not set the cudaDevice itself, executes with the same behaviour, specified in the DeviceFlags.

The other alternative would be that the management of these flag is not implemented in cuda library.

Did someone here have experienced the impact of the cudaFlags on the behaviour of cpu threads executing Asynchronous kernel launching, or asynchronous cudaMemCpy ?

Thank you in advance

Sorry this post belong to Cuda Programming and Performance