memcpyDtoH in a while loop Idle time problem

Hi everyone,

My question concerns the speed of the memory tranfert between the device and the host.

In my program I have a while loop over a variable N. The while loop calls several kernel and the last of them computes the new value N_NEW which is a device variable.

Then the value of N is updated using N_NEW via the cudaMemcpyFromSymbol function.





	Kernel_2...; // Compute N_NEW 


	cudaMemcpyFromSymbol(&N, "N_NEW", sizeof(uint), 0, cudaMemcpyDeviceToHost); 


Now using the Compute Visual Profiler I have noticed that the GPU time of the memory transfer is around 1.7 us but the CPU time is 1256 us. My idle time is rather important between each loop and the speed of my algorithm is highly reduced.

Does someone know how to manage this kind of situation ?

Thank you

I recommend to transfer the N every 10 or 100 times only.

The CPU time probably is so large because the CPU has to wait for all asynchronously launched kernels to finish.
You can have the CPU do some other work before calling cudaMemcpyFromSymbol(), but you cannot completely avoid waiting for the GPU if you need the results of the kernel.
If your kernels are very short and you have many iterations, pasoleatis’ suggestion may reduce the overhead.


I have one other suggestion. Define a global variable something like device unsigned N; (it can also be boolean).
Now put a condition in the kernels if(N>0) do the stuff and update N. Now you can execute the loop for a some Nsteps. Every Nsteps you copy the value to the host to check if the program should stop the while loop. If if the condition is met than you only loose the time for calling the kernel for some number smaller than Nsteps. By tuning Nsteps you can make the code faster by not copying N to the host so often.


thanks a lot for all your answers. In fact I should have been more specific, I’m working with an image divided in block of 16x16 pixels. Only a small portion of the image is processed during one loop. There is an array that contains the addresses of the specific blocks to be processed and N reflects this number of blocks. For every loop, the addresses and N are updated. Therefore, N is also used to configure each kernel call. I fear I won’t be able to apply any of the above suggestions.

I though that processing the image that way was more efficient but the drawback is the time taken to copy one integer on the CPU. Is there any way to create a while loop that is controlled by the GPU (outside a kernel) and not the CPU.

Also a typical number of iteration is a value below 20.

Finally, I will try to see if I can apply one of the above suggestions. N should normally decrease for each iteration but it is not always the case. I dunno how the algorithm will behave in such situation.

If you have an image do you have a maximum value of blocks which are processed? For example if your image would have 100 blocks, would this be the maximum number of iterations in the while loop? If the answer is yes you can still use a variable N global or pointer which would be updated every iteration and when the job is done it would be null and then there would be just calls to kernel doing nothing.

You can also have loops in the kernel, but you need some global synchronizations via syncthread and threadfence functions

Hi pasoleatis,

I did apply your solution. You were right. In fact I misinterpreted something. In my kernel I put the condition :

if(blockIdx.x > N_NEW) return;

N_NEW is a global variable on the device and is updated for each iteration. I assumed that the number of blocks max is defined before entering the loop and that N_NEW will never get above this value. I just don’t want to run my kernel on all the blocks of the image since I wanted to avoid this initially. Some kernels inside the loop are configured with N (variable on the host) which is updated every x iterations now :

Kernel1<<<N, ...> (...);

And in the case where N_NEW > N, some blocks are skipped but they are processed during a next iteration.

Right now the memory transfer occurs less frequently. However, it seems that my algorithm speed has worsened. I’m looking for an explanation. I’ll try to fine tune the number of time the transfer is done.

Thanks for your help.

I’m still trying to find a solution concerning my problem. I’ve attached an image showing the GPU time Width Plot (using the Compute Visual Profiler) before and after the method explained earlier. I have two while loops in my program but they works on the same principle. You can clearly see that the idle time between each loop has been reduced. However, the idle time after the end of the loop is much longer. Do you have an idea what can be the cause ?

Thanks again.