I’m very new into cuda programming and i have some stupid question. I’m trying to create a program that filters a 2D picture. The kernel works but after some testing i found out that the time needed to copy the memory to and from the device takes much longer than executing all the threads on the device.
For a 1024x1024 (each pixel taking 1byte grayscale) picture it takes about 3ms to execute the kernel and 2x80ms to copy the memory (only 1MB).
some of the code:
cudaMemcpy(pixelsIn_GPU,pixIn, datasize, cudaMemcpyHostToDevice); //copy the pixIn received from the picture to the device
cutStartTimer(hTimer); //start timer
ze, nThreads); //start
cutStopTimer(hTimer); //stop timer
cudaMemcpy(pixOut, result_GPU, datasize, cudaMemcpyDeviceToHost); //copy memory back
after executing the timer says 3ms while the global timer of the code says 167ms is this normal or i’m i doing something wrong :"> ? Because if it takes this long i can do it faster with the main CPU!
I’m using geforce 8600M GS (laptop) with 256MB memory and intel centrino duo 2.5GHz
are there faster ways to copy memory than the cudaMemcpy function?
Your timer actualy measures just time for kernel call. After timer stops kernel is still executing. Synchronization point between CPU and GPU is cudaMemcpy call. My suggestion is: use CUDA for more complex things. Or do all the stuff on the GPU. Why do you need your image back on the CPU?