Very slow memory transfer problem Simple program executes very slowly, bandwidth test shows normal r

kleboeuf · February 7, 2011, 9:55pm

Hi everyone,

I’m having a hard time understanding why it takes so long to transfer data from host to device. The SDK bandwidth test tells me that I should be able to get transfer rates of about 2200 MB/s from device to host, but in practice I’m getting speeds of about 150 MB/s. It’s almost as if there is some lag between setting up the transfer, and the transfer actually occuring. At first I thought it was the issue described in this thread, where they discuss GPU needing some setup time, but after trying that solution my problem still remains.

I was hoping that someone could either explain what I’m doing wrong, or help me understand the results that I’m getting.

Bellow is a snippet of very simple code

a.cu (754 Bytes)

that allocates and copies 256 MB from the host to the GPU, and uses an event timer to measure how long it took. Also attached are the results from gprof, the /usr/bin/time command, and the cuda command-line profiler. I’d appreciate any help or comments that anyone can give me.

#include <stdio.h>

#define T 67108864 //this many 32-bit words makes up 256 MB

int main( void )

{

    int a[T];

    int* d_a;

cudaEvent_t start, stop;

    cudaEventCreate( &start );

    cudaEventCreate( &stop );

    cudaEventRecord( start, 0);

for (int i = 0; i<T; i++)

    {

        a[i] = rand();

    }

cudaMalloc( (void**)&d_a, T*sizeof(int));

    cudaMemcpy(d_a,a,T*sizeof(int),cudaMemcpyHostToDevice);

cudaEventRecord( stop, 0);

    cudaEventSynchronize( stop );

float elapsedTime;

    cudaEventElapsedTime( &elapsedTime, start, stop);

printf( "Time to generate: %3.1f ms\n", elapsedTime );

cudaEventDestroy( start );

    cudaEventDestroy( stop );

    cudaFree( d_a);

printf("CUDA error: %s\n",cudaGetErrorString(cudaGetLastError())); //report any errors

    return 0;

}

My host machine is my school’s Intel Core 2 Quad Q9650 @ 3 GHz, and I am running CentOS 4.8 (64-bit). My GPU is the GTX 480, which is using the latest drivers (devdriver 260.19.26, toolkit 3.2.16, SDK 3.2.16).

I have compiled this code with the following options:

nvcc --profile -arch compute_20 -code sm_20 a.cu -o a.out

Then I ran my code with the following command, with the CUDA_PROFILE environment variable set to 1:

/usr/bin/time a.out

And I obtain the following results:

Command line output:

Time to generate: 1529.4 ms

CUDA error: no error

1.38user 0.69system 0:02.19elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k

0inputs+0outputs (0major+66610minor)pagefaults 0swaps

Sumarized results from gprof (full results are attached

gprof.txt (5.86 KB)

):

Flat profile:

Each sample counts as 0.01 seconds.

  %   cumulative   self              self     total           

 time   seconds   seconds    calls  Ts/call  Ts/call  name    

100.13      0.32     0.32                             main

  0.00      0.32     0.00        1     0.00     0.00  global constructors keyed to main

  0.00      0.32     0.00        1     0.00     0.00  __sti____cudaRegisterAll_37_tmpxft_00005d0f_00000000_4_bw_cpp1_ii_main()

CUDA text profiler results:

# CUDA_PROFILE_LOG_VERSION 2.0

# CUDA_DEVICE 0 GeForce GTX 480

# TIMESTAMPFACTOR fffff702ba82dac8

method,gputime,cputime,occupancy

method=[ memcpyHtoD ] gputime=[ 111919.234 ] cputime=[ 112464.000 ]

In other words, according to ‘time’ my simple program takes about 2.1 seconds to transfer 256 MB from host to device; an effective bandwidth of 122 MB/s. My GPU timer records 1.5s for the data transfer alone, and gprof says that my host-only code runs for an accumulated 0.32s. Finally, the CUDA profiler claims that the memory transfer itself took a mere 112ms.

So my big question is: If my GPU timer measured 1.5s, and the CUDA profiler says that the transfer only took 112ms, where did the rest of this time go, and how do I get it back!?

I would greatly appreciate anyone’s input or advice!

Karl Leboeuf

njuffa · February 7, 2011, 10:30pm

I assume you did not mean to include the random number generation in the timed portion of your code. Initializing the array with random numbers probably respresents the largest portion of the time measured, and the call to cudaMemcpy() only a smallish fraction. You might also want to try timing the cudaMemcpy() call in a loop, and reporting the fastest time, similar to the timing methodology used by STREAM.

kleboeuf · February 7, 2011, 11:37pm

Thank you so much for your remarks; I had been staring at this all afternoon without putting it together. I moved my timer intialization so that it would start after my array was initialized, and now I am getting the transfer rates that I was expecting.

Topic		Replies	Views
About Data transfer speed between CPU and GPU? How to increase the data transfer speed? CUDA Programming and Performance	7	15512	December 11, 2009
DATA tranfer from CPU to GPU CUDA Programming and Performance	6	4807	April 23, 2008
Memory Transfer CUDA Programming and Performance	7	2959	October 10, 2008
Memory copy very slow memory copy, image CUDA Programming and Performance	10	12492	April 7, 2011
how to improve the memory allocation rate,data transfer rate from host to device and device to host CUDA Programming and Performance	9	5265	February 26, 2010
The change of speed when copying data between host and device CUDA Programming and Performance pcie , cuda , linux	5	1897	October 12, 2021
Optimize data transfer rate from host to device CUDA Programming and Performance	3	2675	July 27, 2017
How slow is constant memory host-device transfer? The transfer is 25 times slower than my heavy kern CUDA Programming and Performance	3	1237	December 7, 2009
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
how to speed up? data transfer CUDA Programming and Performance	22	3759	April 5, 2011

Very slow memory transfer problem Simple program executes very slowly, bandwidth test shows normal r

Related topics