OpenCl Bottleneck

Hi Everyone,

I am working with the OpenCl SDK and have a question about a bottleneck in my code. The host program is running a loop that queues a set of kernels that have to operate in order on a matrix of data. The host needs to check one value in this matrix at each iteration of the loop in order to know whether or not to keep queueing kernels. I read this value on the host within the loop in the following way:

clEnqueueReadBuffer(command_queue, mem_obj, CL_TRUE, 0, sizeof(double), &value, 0, NULL, NULL);

Is there a faster way (and what is the absolute fastest way) to transfer one double value to the host from the GPU? This line takes up 80% of the loop run time.

Thanks for your help.