I have just started using CUDA for some two dimensional pattern matching experiments (Locating a two dimensional pattern in a two dimensional text). I run the code inside a kernel and the same code in the cpu to measure the performance difference. I get something like this with just one thread:
Kernel: 0.000027 seconds
Cpu: 0.003368 seconds
Is it normal to have such a performance increase just by using one thread? I get correct results, i am using __syncthreads(); in the end of the kernel and i am using cudaThreadSynchronize(); before checking the timer again.
Check for errors in kernel launch… Usually a dumb kernel launch costs around 30 to 40 microsecs… and yours is 30 microsecs… so, looks like you r having a kernel launch error…
Use the “cudaGetLastError()” or the API close to the name…
The code was ok. I checked that the values were correct when i executed the code under the emulator by using printf’s and cudaGetLastError() reported no error.
As soon as i passed some value from the kernel to a shared variable after the __syncthreads() function, cutGetTimerValue started measuring the correct time:
Kernel: 0.729799 seconds (for 1 thread)
Cpu: 0.003386 seconds
Why does this behavior exists? Does the kernel have to always return some form of data back to the main program?
hmmm maybe you’re right, although it’s not the “correct” way to optimize the code. Which compile is cuda using? Gcc definitely doesn’t have this behavior.