cudaMemcpy too slow


Currently I’m working with CNN related project, the goal to implement YOLO convolutional neural network in real-time using GPU and I faced certain problem. Overall, the all calculations of CNN layers on GPU runs fast (~15 ms), however I didn’t find the way how to be fast when copying final results back to CPU memory. cudaMemcpy takes about 55 seconds!!! even when copying single float variable:

cudaMemcpy(var1, var, sizeof(float), cudaMemcpyDeviceToHost);

I have tried to use page-locked memory:

float* var1;

cudaHostAlloc((void**)&var1, sizeof(float),0);

cudaStream_t mystream1;


cudaMemcpyAsync(var1, var, sizeof(float), cudaMemcpyDeviceToHost, mystream1);

now it runs fast, but returns wrong result. If I add:


the result is correct, but the copying time is about 55 seconds again. I’m new in GPU programming and it would be great if you provide some suggestions or examples how to be fast returning from the device to host.

You’re getting confused by the timing due to the asynchronous calls you are using. The cudaMemcpy call doesn’t take 55 seconds itself. Instead, the previous asynchronous calls are taking that time to complete, and the cudaMemcpy call is forcing the CPU thread to wait for the completion, so it appears to be taking all that time.

Focusing on “speeding up the copy operation” is the wrong idea here.