copy memory slow?

hello

I’m very new into cuda programming and i have some stupid question. I’m trying to create a program that filters a 2D picture. The kernel works but after some testing i found out that the time needed to copy the memory to and from the device takes much longer than executing all the threads on the device.

For a 1024x1024 (each pixel taking 1byte grayscale) picture it takes about 3ms to execute the kernel and 2x80ms to copy the memory (only 1MB).

some of the code:

cudaMemcpy(pixelsIn_GPU,pixIn, datasize, cudaMemcpyHostToDevice); //copy the pixIn received from the picture to the device

cutResetTimer(hTimer);

cutStartTimer(hTimer); //start timer

Filter<<<1,nThreads>>>(pixelsIn_GPU,result_GPU,sizex,sizey,kernel_GPU,kernelMSi

ze, nThreads); //start 

cutStopTimer(hTimer); //stop timer

cudaMemcpy(pixOut, result_GPU, datasize, cudaMemcpyDeviceToHost); //copy memory back

after executing the timer says 3ms while the global timer of the code says 167ms is this normal or i’m i doing something wrong :"> ? Because if it takes this long i can do it faster with the main CPU!

I’m using geforce 8600M GS (laptop) with 256MB memory and intel centrino duo 2.5GHz

are there faster ways to copy memory than the cudaMemcpy function?

Thanks

Tom

Your timer actualy measures just time for kernel call. After timer stops kernel is still executing. Synchronization point between CPU and GPU is cudaMemcpy call. My suggestion is: use CUDA for more complex things. Or do all the stuff on the GPU. Why do you need your image back on the CPU?

thanks that explains a lot it was just some basic testing program i was making i need to do something with cuda for school but they still need to tell me what needs to be done :unsure: