cudaFree() boosts bandwidth performance unwanted feature :)

Hi,

I encountered something weird today. I am undistorting an 6 megapixel image.

I need to compute two maps with Undistortion Parameters each the size of around 24 MByte like this:

float* d_map_x;
float* d_map_y;

CUDA_SAFE_CALL(cudaMalloc((void**) &d_map_x, memSize_float));
CUDA_SAFE_CALL(cudaMalloc((void**) &d_map_y, memSize_float));

After the kernel has finished the computation of the maps I need those maps to do the
undistortion in a different kernel.

The map is only computed once, but the undistortion is performed 30 times per second. (HD-Video)

Here comes my problem:
The memory transfer of the image (~18 MByte) device to host is much slower than host to device.
I only get around 669 MB/sec hostToDevice and 133MB/sec deviceToHost.

I played around with some settings to boost the slow transfer performance but nothing worked so far.

Then I tried to free the memory for the maps:
CUDA_SAFE_CALL(cudaFree(d_map_x));
CUDA_SAFE_CALL(cudaFree(d_map_y));

Now I get a transfer speed deviceToHost of 1390MB/sec!

The memory is freed right after the execution of the kernel.
I really need to boost memory transfer performance but I can’t free the memory of the maps or otherwise I can’t undistort the image :)

Can anybody help me with that?
I need a very fast memory transfer since the image undistortion itself only takes 0.22 ms for a 6 megapixel image but the image up and download 120 ms!!

Are you using DMA host memory for the transfer (aka. “pinned” memory) ? In this case 2*24 MB for the buffers is a pretty big block for the DMA mapper.

Peter

What exactly is this “pinned” memory.
Can I explicitly specify DMA memory transfer?

this is what i do:

CUDA_SAFE_CALL(cudaMemcpy(dst_data, d_dst_data, memSize_char*nChannels, cudaMemcpyHostToDevice) );

dst_data and d_dst_data are both unsigned char*

Peter,

I just took a quick look at the bandwidthTest Code form the SDK and now I understood what is meant by “pinned” memory.

thanks for the hint! I’ll try out the pinned malloc version and take a look at the results

To me 669MB/sec sounds reasonable, but 133MB/sec is definitely too low. The readback speed should be about the same as the download speed, even if I usually get a slightly lower value. You can’t expect something faster than about 1GB/sec, so the whole problem might not be suitable to GPU processing, unless you intend to use some more complicated filters later on.

I think so too, 1GB/sec is insufficient for this.
I’d have to put some additional computations on the GPU to hide the memory transfer latency.

Thank you very much!

Yeah try that. But be aware that locked memory pages are a scarce resource in the system, so I would not use 2*24 MB buffers. This goes for all operating systems btw. I would do one working buffer (“staging buffer”) that is used throughout the app to transfer everything to/from the GPU. Sizes of up to 16MB did the trick for me and seem to be acceptable for a range of machines.

I get 3.4 G/sec up/down using DMA locked memory. However it seems that you need an NVIDIA chipset on the mobo to get that far. See other posts on this forum about various bandwidth issues.

Peter