cudaSetDeviceFlags and cudaDeviceScheduleYield in embedded envirronment

Dear all,

I have successfully implemented a set of image processing algorithm that aim to be performed in real time, on a high end test plateform, with a gtx680.

But the targeted application aimed to take out the burden of image processing from the CPU (intel core2 on a current platform) to the GPU in order the CPU to be able to lower its consumption, or perform other tasks in parrallel.

Unfortunately, when porting my executable on a light target configuration, it appeared that the two application thread that were pushing computing tasks to their own GPU streams had a very high CPU utilisation.

Using Nvidia visual Profiler, I noticed that, while using low end GPU (gt610, gt240, etc…) the cpu threads were most of the time inside the cudaMemcpyAsync, waiting for the GPU to finish its computing tasks.
Using Intel profiler Vtune Amplifier, the following call hierarchy appeared:

MyApplication::CopyToHost
->cudaMemcpyAsync
->->libcudart.so.5.5.11
->->->clock_gettime

libcudart.so.5.5.11 and clock_gettime call consume almost all the CPU time of the application, this made me think of polling inside the API calls, asking for something from nvidia driver.

So, naturally, I tried to use cudaSetDeviceFlags with the different flag, and especially cudaDeviceScheduleYield and cudaDeviceScheduleBlockingSync, that should have solved all my problems:

“cudaDeviceScheduleYield: Instruct CUDA to yield its thread when waiting for results from the device. This can increase latency when waiting for the device, but can increase the performance of CPU threads performing work in parallel with the device.”

“cudaDeviceScheduleBlockingSync: cudaDeviceScheduleBlockingSync: Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the device to finish work.”

I tested all flags, and each time I got the same CPU utilization.

My question now, is, am I using cudaSetDeviceFlags properly ?
Currently, I have a pool of two threads, and the first thread to be ready initialize the device, create streams, allocate buffers, and stores them as context: structure containing cudaStreams and cudaBuffers.
The pool of context could be accessed through threadsafe context distributor.

I don’t know if the second thread, that do not set the cudaDevice itself, executes with the same behaviour, specified in the DeviceFlags.

The other alternative would be that the management of these flag is not implemented in cuda library.

Did someone here have experienced the impact of the cudaFlags on the behaviour of cpu threads executing Asynchronous kernel launching, or asynchronous cudaMemCpy ?

Thank you in advance

PS: Here is a dummy code that exhibit the problem: just compile it with various cudaflags and check with your OS monitor the CPU load:

#include <stdio.h>
#include <stdlib.h>

#include <cuda_runtime_api.h>

void __checkCudaErrors( cudaError err, const char *file, const int line )
{
if( err != cudaSuccess )
{
printf("%s(%ld) : CUDA Runtime API error %ld : %s \n",file ,line, (int)err, cudaGetErrorString(err));
}
}

global void kernelDummy(float* buf, int j)
{
int addr = (blockIdx.yblockDim.y+threadIdx.y)(blockDim.xgridDim.x)+(blockIdx.xblockDim.x+threadIdx.x);
buf[addr]=((float)addr)*cos((float)addr)*pow(buf[addr%(addr/2)],3);
}

void __checkCudaErrors( cudaError err, const char *file, const int line );
#define checkCudaErrors(err) __checkCudaErrors (err, FILE, LINE)

int main(int argc, char argv)
{
float
d_buf;
float
h_buf;

long bufSize = 1024*1024*1024;


//checkCudaErrors(cudaSetDeviceFlags(cudaDeviceScheduleAuto));  // GB to change if necessary
//checkCudaErrors(cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync));//Lower CPU Utilization
checkCudaErrors(cudaSetDeviceFlags(cudaDeviceScheduleYield));//Lower CPU utilization
//checkCudaErrors(cudaSetDeviceFlags(cudaDeviceScheduleSpin));//High CPU Utilization

checkCudaErrors(cudaSetDevice( 0 ));

cudaStream_t myStream;

// creation du stream
checkCudaErrors(cudaStreamCreate(&myStream ));

checkCudaErrors(cudaMallocHost((void**)&h_buf,bufSize));
checkCudaErrors(cudaMalloc((void**)&d_buf,bufSize));
dim3 threads(16,16);
dim3 blocks(1024,1024);

int j = 0;
for(int i=0; i<20;i++)
{
	checkCudaErrors(cudaMemcpyAsync(d_buf,h_buf,bufSize,cudaMemcpyHostToDevice,myStream));
	kernelDummy<<<blocks,threads,0,myStream>>>(d_buf,j);
	checkCudaErrors(cudaMemcpyAsync(h_buf,d_buf,bufSize,cudaMemcpyDeviceToHost,myStream));
	checkCudaErrors(cudaStreamSynchronize(myStream));
}


checkCudaErrors(cudaFreeHost(h_buf));
checkCudaErrors(cudaFree(d_buf));
checkCudaErrors(cudaStreamDestroy(myStream));



return 0;

}

It appeared that under windows, the cudaDeviceScheduleBlockingSync flag does what it intends to do.
There is a real difference between:

-the cudaDeviceScheduleSpin that result in a 100% cpu utilization
-the cudaDeviceScheduleBlockingSync that result in a few % cpu utilization

But I suspect that there is no implementation of these flags in the linux version of the API, although this is not mentionned in the documentation.

I would really appreciate NVidia developers comment on this problem.

Should I fill a bug report, although it is not causing any error, and it is only about linux version of the API ?

I would suggest filing a bug, with a self-contained repro case attached. From your description it seems like you would want to particularly point out the difference between the behavior on Windows vs Linux. You can find the bug reporting form linked from the registered developer website. Thank you for your help.