Large overhead on cudaMemcpy, isolated case


I have written a small CUDA program that performs Descrete Cosine Transform (DCT) of image frames of varying sizes, and I see an interesting result that I am unable to explain.

My testbed consists of two separate computers:

  • A Dell XPS with a GM8400M GPU.
  • An HP Compaq with a GTX280 and an NVS 295. The NVS card renders the display.

Please find attached my test results of DCT of varying matrix sizes, profile_detailed.pdf. The plot shows the individual steps taken to perform the transformation on different GPUs and CPUs. Notice that, for the GPUs that render the display (the GM8400M and the NVS295) the memory copy overhead is very large when compared to the card that isn’t used for display rendering (the GTX280).

A quick look at the raw test data shows that, for both cases, the problem is the memory copy from device to host (the data has been simplified to shorten the size of the post):

Running on GeForce 8400M GS
-> Performing DCT:width=512,height=512
cudaMemcpy(data_d, data, width * height * sizeof(float), cudaMemcpyHostToDevice), time=0.000708327978
cudaMemcpy(out, odata_d, width * height * sizeof(float), cudaMemcpyDeviceToHost), time=0.174465611577

Running on Quadro NVS 295
-> Performing DCT:width=512,height=512
cudaMemcpy(data_d, data, width * height * sizeof(float), cudaMemcpyHostToDevice), time=0.000250003010
cudaMemcpy(out, odata_d, width * height * sizeof(float), cudaMemcpyDeviceToHost), time=0.208029866219

For the transfer size in question (1 MB), this corresponds to a bandwidth of about 5 MB/s for device to host transfers. This is directly in contradiction with a simple bandwidth test from the SDK example:

Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)

992000 2459.8
996000 2467.2
1000000 2479.7
1004000 2538.4
1008000 2543.8

So, the question is why bandwidths of this type is so much lower for the GPU that runs the display. Another observation I have made regarding this is that the GPU that renders the display has a limited memory pool at around 30 - 45 MB. The GPU DCT test looks like this:

RUNTEST(cudaMalloc((void**)&cosTable, sizeof(float) * 8 * 8 * 8 * 8))
RUNTEST(cudaMalloc((void**)&qntTable, sizeof(float) * 8 * 8))
RUNTEST(cudaMalloc((void**)&data_d,   sizeof(float) * width * height))
RUNTEST(cudaMalloc((void**)&odata_d,  sizeof(float) * width * height))

RUNTESTKERN(gpuCalcCosineTable<<<cos_blockDim, cos_threadDim>>>(cosTable))
RUNTESTKERN(gpuCalcNormTable<<<qnt_blockDim, qnt_threadDim>>>(qntTable))

RUNTEST(cudaMemcpy(data_d, data, width * height * sizeof(float), cudaMemcpyHostToDevice))

RUNTESTKERN(gpuSubtract<<<mNrm_blockDim, mNrm_threadDim>>>(data_d, width, height))
RUNTESTKERN(gpuInnerLoop<<<dctI_blockDim, dctI_threadDim>>>(cosTable, data_d, odata_d, width, height))

RUNTESTKERN(gpuNormalize<<<dctN_blockDim, dctN_threadDim>>>(odata_d, qntTable, width))

RUNTEST(cudaMemcpy(out, odata_d, width * height * sizeof(float), cudaMemcpyDeviceToHost))


profile_detailed.pdf (21.6 KB)

The device to host transfer time includes the time for waiting on the calculation to finish. If you want to measure transfer time alone, insert a cudaDeviceSynchronize() call before timing the device to host transfer.

Nice. I didn’t know that the execution configuration syntax returned immediately…