I have written a small program which performs a motion estimation within two consecutive bitmaps. I have programmed one part of the algorithm in two versions, one aiming at the CPU and with CUDA aiming at the GPU. This part is a simple comparison of two 16 x 16 pixel blocks.
The results of a small benchmark test of the two versions were astonishing. The CPU-Version is 800 times faster than the CUDA version.
My first guess is, that the CUDA code is not really executed on the GPU, so here are my questions:
How can I control, that the kernel execution really runs on the GPU?
Are any tools available for measuring the GPU performance like the taskmanager?
Many thanks for any replies.