how to improve the memory allocation rate,data transfer rate from host to device and device to host

hardware and software i use:
NVIDIA GeForce 8600 GT
cudatoolkit_2.3_win_32
cudasdk_2.3_win_32
cudadriver_2.3_winxp_32_190.38_general

the test result in my code is as follow:
cudaMalloc((void**)&d_date, 3400*3400) -----consuming 43ms
cudaMemcpy( h_ZoomImg, d_date, size, cudaMemcpyDeviceToHost) ----consuming 20ms

how to improve these speed? thanks for your reply.

As a starting point, run the bandwidthTest application in the CUDA SDK and post the results. That will provide a standard measure of the performance of your host/gpu.

Running on…
device 0:GeForce 86ooGT

Quick Mode
Host to Device Bandwidth for Pageable memory
Transfer Size Bandwidth<MB/s>
33554432 1571.7

Quick Mode
Device to Host Bandwidth for Pageable memory
Transfer Size Bandwidth<MB/s>
33554432 1312.2

Quick Mode
Device to Device Bandwidth
Transfer Size Bandwidth<MB/s>
33554432 14251.1

OK, so those pageable numbers are a little bit on the low side for a PCI-e 1.0 host/card, but not so much that there would be anything wrong. Which makes me think that there could be some problems with the way you are timing those memory management functions in your code.

1.the infomation about my computer:
PCI-e 2.0 host
PCI-e 1.0 card
if i use PCI-e 2.0 card , could the transfer rate be improved ?

2.the method i test the speed
unsigned int timer = 0;
cutilCheckError( cutCreateTimer( &timer));
cutilCheckError( cutStartTimer( timer));
cudaMemcpy( h_ZoomImg, d_date, size, cudaMemcpyDeviceToHost);
cutilCheckError( cutStopTimer( timer));
printf(“Processing time: %f (ms)\n”, cutGetTimerValue( timer));
printf(“%.2f Mpixels/sec\n”, (Width*Height / (cutGetTimerValue( timer) / 1000.0f)) / 1e6);
cutilCheckError( cutDeleteTimer( timer));

A PCI-e v2.0 card would be about twice as fast. Your posted cudaMemcpy() timing suggests you should be able to transfer about 26Mb in the 20ms you measure at 1300Mb/s peak device to host bandwidth. But I am guessing the amount of data you are transferring is much less than that, which is why I asked about the timing. Is there a kernel execution before the cudaMemcpy() call in your code?

you are right, call cudaMemcpy() after a kernel execute in my code. what’s wrong with this?
i have another problem about what have affect on speed of Memory allocation?
cudaMalloc((void**)&d_date, 3400*3400) -----consuming 43ms

thanks!

Nothing, except that it probably means that your time measurement isn’t the time for the memcpy call, but the time for both the kernel execution and the memcpy. CUDA kernel launches are non-blocking, but copies are blocking. Try running this code for timing your mempy call instead:

cutilCheckError( cudaThreadSynchronize() );

unsigned int timer = 0;

cutilCheckError( cutCreateTimer( &timer));

cutilCheckError( cutStartTimer( timer));

cudaMemcpy( h_ZoomImg, d_date, size, cudaMemcpyDeviceToHost);

cutilCheckError( cutStopTimer( timer));

printf("Processing time: %f (ms)\n", cutGetTimerValue( timer));

printf("%.2f Mpixels/sec\n", (Width*Height / (cutGetTimerValue( timer) / 1000.0f)) / 1e6);

cutilCheckError( cutDeleteTimer( timer));

The added call to cudaThreadSynchonize() will make the host block until the kernel completes execution, so that your memcpy() timing is really measured only the copy time.

you are right again.
the test result of cudaMemcpy timing in bandwidthtest is as same as the test result in my code, so i think i has to update the hardware to improve transfer rate.

what do you think the second problem about memory allocation rate?

hi, avidday!
you can ignore that cudaMemcpy from device to host is blocking which means that testing cudaMemcpy timing with cudaThreadSynchronize is wrong.