I encountered something weird today. I am undistorting an 6 megapixel image.
I need to compute two maps with Undistortion Parameters each the size of around 24 MByte like this:
CUDA_SAFE_CALL(cudaMalloc((void**) &d_map_x, memSize_float));
CUDA_SAFE_CALL(cudaMalloc((void**) &d_map_y, memSize_float));
After the kernel has finished the computation of the maps I need those maps to do the
undistortion in a different kernel.
The map is only computed once, but the undistortion is performed 30 times per second. (HD-Video)
Here comes my problem:
The memory transfer of the image (~18 MByte) device to host is much slower than host to device.
I only get around 669 MB/sec hostToDevice and 133MB/sec deviceToHost.
I played around with some settings to boost the slow transfer performance but nothing worked so far.
Then I tried to free the memory for the maps:
Now I get a transfer speed deviceToHost of 1390MB/sec!
The memory is freed right after the execution of the kernel.
I really need to boost memory transfer performance but I can’t free the memory of the maps or otherwise I can’t undistort the image :)
Can anybody help me with that?
I need a very fast memory transfer since the image undistortion itself only takes 0.22 ms for a 6 megapixel image but the image up and download 120 ms!!