Hi everyone,
I’m having a hard time understanding why it takes so long to transfer data from host to device. The SDK bandwidth test tells me that I should be able to get transfer rates of about 2200 MB/s from device to host, but in practice I’m getting speeds of about 150 MB/s. It’s almost as if there is some lag between setting up the transfer, and the transfer actually occuring. At first I thought it was the issue described in this thread, where they discuss GPU needing some setup time, but after trying that solution my problem still remains.
I was hoping that someone could either explain what I’m doing wrong, or help me understand the results that I’m getting.
Bellow is a snippet of very simple code
a.cu (754 Bytes)
that allocates and copies 256 MB from the host to the GPU, and uses an event timer to measure how long it took. Also attached are the results from gprof, the /usr/bin/time command, and the cuda command-line profiler. I’d appreciate any help or comments that anyone can give me.
#include <stdio.h>
#define T 67108864 //this many 32-bit words makes up 256 MB
int main( void )
{
int a[T];
int* d_a;
cudaEvent_t start, stop;
cudaEventCreate( &start );
cudaEventCreate( &stop );
cudaEventRecord( start, 0);
for (int i = 0; i<T; i++)
{
a[i] = rand();
}
cudaMalloc( (void**)&d_a, T*sizeof(int));
cudaMemcpy(d_a,a,T*sizeof(int),cudaMemcpyHostToDevice);
cudaEventRecord( stop, 0);
cudaEventSynchronize( stop );
float elapsedTime;
cudaEventElapsedTime( &elapsedTime, start, stop);
printf( "Time to generate: %3.1f ms\n", elapsedTime );
cudaEventDestroy( start );
cudaEventDestroy( stop );
cudaFree( d_a);
printf("CUDA error: %s\n",cudaGetErrorString(cudaGetLastError())); //report any errors
return 0;
}
My host machine is my school’s Intel Core 2 Quad Q9650 @ 3 GHz, and I am running CentOS 4.8 (64-bit). My GPU is the GTX 480, which is using the latest drivers (devdriver 260.19.26, toolkit 3.2.16, SDK 3.2.16).
I have compiled this code with the following options:
nvcc --profile -arch compute_20 -code sm_20 a.cu -o a.out
Then I ran my code with the following command, with the CUDA_PROFILE environment variable set to 1:
/usr/bin/time a.out
And I obtain the following results:
Command line output:
Time to generate: 1529.4 ms
CUDA error: no error
1.38user 0.69system 0:02.19elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+66610minor)pagefaults 0swaps
Sumarized results from gprof (full results are attached
gprof.txt (5.86 KB)
):
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
100.13 0.32 0.32 main
0.00 0.32 0.00 1 0.00 0.00 global constructors keyed to main
0.00 0.32 0.00 1 0.00 0.00 __sti____cudaRegisterAll_37_tmpxft_00005d0f_00000000_4_bw_cpp1_ii_main()
CUDA text profiler results:
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GTX 480
# TIMESTAMPFACTOR fffff702ba82dac8
method,gputime,cputime,occupancy
method=[ memcpyHtoD ] gputime=[ 111919.234 ] cputime=[ 112464.000 ]
In other words, according to ‘time’ my simple program takes about 2.1 seconds to transfer 256 MB from host to device; an effective bandwidth of 122 MB/s. My GPU timer records 1.5s for the data transfer alone, and gprof says that my host-only code runs for an accumulated 0.32s. Finally, the CUDA profiler claims that the memory transfer itself took a mere 112ms.
So my big question is: If my GPU timer measured 1.5s, and the CUDA profiler says that the transfer only took 112ms, where did the rest of this time go, and how do I get it back!?
I would greatly appreciate anyone’s input or advice!
Karl Leboeuf