hardware and software i use:
NVIDIA GeForce 8600 GT
cudatoolkit_2.3_win_32
cudasdk_2.3_win_32
cudadriver_2.3_winxp_32_190.38_general
the test result in my code is as follow:
cudaMalloc((void**)&d_date, 3400*3400) -----consuming 43ms
cudaMemcpy( h_ZoomImg, d_date, size, cudaMemcpyDeviceToHost) ----consuming 20ms
how to improve these speed? thanks for your reply.
As a starting point, run the bandwidthTest application in the CUDA SDK and post the results. That will provide a standard measure of the performance of your host/gpu.
OK, so those pageable numbers are a little bit on the low side for a PCI-e 1.0 host/card, but not so much that there would be anything wrong. Which makes me think that there could be some problems with the way you are timing those memory management functions in your code.
A PCI-e v2.0 card would be about twice as fast. Your posted cudaMemcpy() timing suggests you should be able to transfer about 26Mb in the 20ms you measure at 1300Mb/s peak device to host bandwidth. But I am guessing the amount of data you are transferring is much less than that, which is why I asked about the timing. Is there a kernel execution before the cudaMemcpy() call in your code?
you are right, call cudaMemcpy() after a kernel execute in my code. what’s wrong with this?
i have another problem about what have affect on speed of Memory allocation?
cudaMalloc((void**)&d_date, 3400*3400) -----consuming 43ms
Nothing, except that it probably means that your time measurement isn’t the time for the memcpy call, but the time for both the kernel execution and the memcpy. CUDA kernel launches are non-blocking, but copies are blocking. Try running this code for timing your mempy call instead:
The added call to cudaThreadSynchonize() will make the host block until the kernel completes execution, so that your memcpy() timing is really measured only the copy time.
you are right again.
the test result of cudaMemcpy timing in bandwidthtest is as same as the test result in my code, so i think i has to update the hardware to improve transfer rate.
what do you think the second problem about memory allocation rate?
hi, avidday!
you can ignore that cudaMemcpy from device to host is blocking which means that testing cudaMemcpy timing with cudaThreadSynchronize is wrong.