Hello,
Currently I’m working with CNN related project, the goal to implement YOLO convolutional neural network in real-time using GPU and I faced certain problem. Overall, the all calculations of CNN layers on GPU runs fast (~15 ms), however I didn’t find the way how to be fast when copying final results back to CPU memory. cudaMemcpy takes about 55 seconds!!! even when copying single float variable:
cudaMemcpy(var1, var, sizeof(float), cudaMemcpyDeviceToHost);
I have tried to use page-locked memory:
float* var1;
cudaHostAlloc((void**)&var1, sizeof(float),0);
cudaStream_t mystream1;
cudaStreamCreate(&mystream1);
cudaMemcpyAsync(var1, var, sizeof(float), cudaMemcpyDeviceToHost, mystream1);
now it runs fast, but returns wrong result. If I add:
cudaStreamSynchronize(mystream1);
the result is correct, but the copying time is about 55 seconds again. I’m new in GPU programming and it would be great if you provide some suggestions or examples how to be fast returning from the device to host.