how to improve the memory allocation rate,data transfer rate from host to device and device to host

rebecca_juan · February 22, 2010, 8:47am

hardware and software i use:
NVIDIA GeForce 8600 GT
cudatoolkit_2.3_win_32
cudasdk_2.3_win_32
cudadriver_2.3_winxp_32_190.38_general

the test result in my code is as follow:
cudaMalloc((void**)&d_date, 3400*3400) -----consuming 43ms
cudaMemcpy( h_ZoomImg, d_date, size, cudaMemcpyDeviceToHost) ----consuming 20ms

how to improve these speed? thanks for your reply.

avidday · February 22, 2010, 9:05am

As a starting point, run the bandwidthTest application in the CUDA SDK and post the results. That will provide a standard measure of the performance of your host/gpu.

rebecca_juan · February 23, 2010, 6:24am

Running on…
device 0:GeForce 86ooGT

Quick Mode
Host to Device Bandwidth for Pageable memory
Transfer Size Bandwidth<MB/s>
33554432 1571.7

Quick Mode
Device to Host Bandwidth for Pageable memory
Transfer Size Bandwidth<MB/s>
33554432 1312.2

Quick Mode
Device to Device Bandwidth
Transfer Size Bandwidth<MB/s>
33554432 14251.1

avidday · February 23, 2010, 7:31am

OK, so those pageable numbers are a little bit on the low side for a PCI-e 1.0 host/card, but not so much that there would be anything wrong. Which makes me think that there could be some problems with the way you are timing those memory management functions in your code.

rebecca_juan · February 23, 2010, 8:03am

1.the infomation about my computer:
PCI-e 2.0 host
PCI-e 1.0 card
if i use PCI-e 2.0 card , could the transfer rate be improved ?

2.the method i test the speed
unsigned int timer = 0;
cutilCheckError( cutCreateTimer( &timer));
cutilCheckError( cutStartTimer( timer));
cudaMemcpy( h_ZoomImg, d_date, size, cudaMemcpyDeviceToHost);
cutilCheckError( cutStopTimer( timer));
printf(“Processing time: %f (ms)\n”, cutGetTimerValue( timer));
printf(“%.2f Mpixels/sec\n”, (Width*Height / (cutGetTimerValue( timer) / 1000.0f)) / 1e6);
cutilCheckError( cutDeleteTimer( timer));

avidday · February 23, 2010, 8:32am

A PCI-e v2.0 card would be about twice as fast. Your posted cudaMemcpy() timing suggests you should be able to transfer about 26Mb in the 20ms you measure at 1300Mb/s peak device to host bandwidth. But I am guessing the amount of data you are transferring is much less than that, which is why I asked about the timing. Is there a kernel execution before the cudaMemcpy() call in your code?

rebecca_juan · February 24, 2010, 2:04am

you are right, call cudaMemcpy() after a kernel execute in my code. what’s wrong with this?
i have another problem about what have affect on speed of Memory allocation?
cudaMalloc((void**)&d_date, 3400*3400) -----consuming 43ms

thanks!

avidday · February 25, 2010, 3:13pm

Nothing, except that it probably means that your time measurement isn’t the time for the memcpy call, but the time for both the kernel execution and the memcpy. CUDA kernel launches are non-blocking, but copies are blocking. Try running this code for timing your mempy call instead:

cutilCheckError( cudaThreadSynchronize() );

unsigned int timer = 0;

cutilCheckError( cutCreateTimer( &timer));

cutilCheckError( cutStartTimer( timer));

cudaMemcpy( h_ZoomImg, d_date, size, cudaMemcpyDeviceToHost);

cutilCheckError( cutStopTimer( timer));

printf("Processing time: %f (ms)\n", cutGetTimerValue( timer));

printf("%.2f Mpixels/sec\n", (Width*Height / (cutGetTimerValue( timer) / 1000.0f)) / 1e6);

cutilCheckError( cutDeleteTimer( timer));

The added call to cudaThreadSynchonize() will make the host block until the kernel completes execution, so that your memcpy() timing is really measured only the copy time.

rebecca_juan · February 26, 2010, 2:01am

you are right again.
the test result of cudaMemcpy timing in bandwidthtest is as same as the test result in my code, so i think i has to update the hardware to improve transfer rate.

what do you think the second problem about memory allocation rate?

rebecca_juan · February 26, 2010, 6:00am

hi, avidday!
you can ignore that cudaMemcpy from device to host is blocking which means that testing cudaMemcpy timing with cudaThreadSynchronize is wrong.

Topic		Replies	Views
cudaMemcpyDeviceToHost speed how to improve speed CUDA Programming and Performance	3	12502	June 13, 2008
cudaMemcpy host->device and device->host speed CUDA Programming and Performance	6	15182	April 29, 2014
CudaMemcpy Bandwidth is influenced by idle operations CUDA Programming and Performance cuda , performance	3	588	March 31, 2023
Copying memory from device to Host takes too much time CUDA Programming and Performance	7	3392	October 5, 2010
About Data transfer speed between CPU and GPU? How to increase the data transfer speed? CUDA Programming and Performance	7	15512	December 11, 2009
Very slow memory transfer problem Simple program executes very slowly, bandwidth test shows normal r CUDA Programming and Performance	2	907	February 7, 2011
Why cudaMemcpyDeviceToHost is too slowly? CUDA Programming and Performance	1	605	November 16, 2021
cudaMemcpy half bandwidthTest --memory=pinned ftfm CUDA Programming and Performance	9	10945	October 16, 2010
Bandwidth is too slow so cudaMemcpy() takes too long. CUDA Programming and Performance	15	7506	December 12, 2012
Bad PCIe transfer performance (cudaMemcpy), what can cause that? CUDA Programming and Performance	10	11532	September 20, 2010

how to improve the memory allocation rate,data transfer rate from host to device and device to host

Related topics