Very slow memory transfer problem Simple program executes very slowly, bandwidth test shows normal r

Hi everyone,

I’m having a hard time understanding why it takes so long to transfer data from host to device. The SDK bandwidth test tells me that I should be able to get transfer rates of about 2200 MB/s from device to host, but in practice I’m getting speeds of about 150 MB/s. It’s almost as if there is some lag between setting up the transfer, and the transfer actually occuring. At first I thought it was the issue described in this thread, where they discuss GPU needing some setup time, but after trying that solution my problem still remains.

I was hoping that someone could either explain what I’m doing wrong, or help me understand the results that I’m getting.

Bellow is a snippet of very simple code (754 Bytes)

that allocates and copies 256 MB from the host to the GPU, and uses an event timer to measure how long it took. Also attached are the results from gprof, the /usr/bin/time command, and the cuda command-line profiler. I’d appreciate any help or comments that anyone can give me.

#include <stdio.h>

#define T 67108864 //this many 32-bit words makes up 256 MB

int main( void )


    int a[T];

    int* d_a;

cudaEvent_t start, stop;

    cudaEventCreate( &start );

    cudaEventCreate( &stop );

    cudaEventRecord( start, 0);

for (int i = 0; i<T; i++)


        a[i] = rand();


cudaMalloc( (void**)&d_a, T*sizeof(int));


cudaEventRecord( stop, 0);

    cudaEventSynchronize( stop );

float elapsedTime;

    cudaEventElapsedTime( &elapsedTime, start, stop);

printf( "Time to generate: %3.1f ms\n", elapsedTime );

cudaEventDestroy( start );

    cudaEventDestroy( stop );

    cudaFree( d_a);

printf("CUDA error: %s\n",cudaGetErrorString(cudaGetLastError())); //report any errors

    return 0;


My host machine is my school’s Intel Core 2 Quad Q9650 @ 3 GHz, and I am running CentOS 4.8 (64-bit). My GPU is the GTX 480, which is using the latest drivers (devdriver 260.19.26, toolkit 3.2.16, SDK 3.2.16).

I have compiled this code with the following options:

nvcc --profile -arch compute_20 -code sm_20 -o a.out

Then I ran my code with the following command, with the CUDA_PROFILE environment variable set to 1:

/usr/bin/time a.out

And I obtain the following results:

Command line output:

Time to generate: 1529.4 ms

CUDA error: no error

1.38user 0.69system 0:02.19elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k

0inputs+0outputs (0major+66610minor)pagefaults 0swaps

Sumarized results from gprof (full results are attached

gprof.txt (5.86 KB)


Flat profile:

Each sample counts as 0.01 seconds.

  %   cumulative   self              self     total           

 time   seconds   seconds    calls  Ts/call  Ts/call  name    

100.13      0.32     0.32                             main

  0.00      0.32     0.00        1     0.00     0.00  global constructors keyed to main

  0.00      0.32     0.00        1     0.00     0.00  __sti____cudaRegisterAll_37_tmpxft_00005d0f_00000000_4_bw_cpp1_ii_main()

CUDA text profiler results:


# CUDA_DEVICE 0 GeForce GTX 480

# TIMESTAMPFACTOR fffff702ba82dac8


method=[ memcpyHtoD ] gputime=[ 111919.234 ] cputime=[ 112464.000 ]

In other words, according to ‘time’ my simple program takes about 2.1 seconds to transfer 256 MB from host to device; an effective bandwidth of 122 MB/s. My GPU timer records 1.5s for the data transfer alone, and gprof says that my host-only code runs for an accumulated 0.32s. Finally, the CUDA profiler claims that the memory transfer itself took a mere 112ms.

So my big question is: If my GPU timer measured 1.5s, and the CUDA profiler says that the transfer only took 112ms, where did the rest of this time go, and how do I get it back!?

I would greatly appreciate anyone’s input or advice!

Karl Leboeuf

I assume you did not mean to include the random number generation in the timed portion of your code. Initializing the array with random numbers probably respresents the largest portion of the time measured, and the call to cudaMemcpy() only a smallish fraction. You might also want to try timing the cudaMemcpy() call in a loop, and reporting the fastest time, similar to the timing methodology used by STREAM.

Thank you so much for your remarks; I had been staring at this all afternoon without putting it together. I moved my timer intialization so that it would start after my array was initialized, and now I am getting the transfer rates that I was expecting.