How to control that kernel is executed on GPU? performance measurement of a CUDA kernel

Hello everyone,

I have written a small program which performs a motion estimation within two consecutive bitmaps. I have programmed one part of the algorithm in two versions, one aiming at the CPU and with CUDA aiming at the GPU. This part is a simple comparison of two 16 x 16 pixel blocks.

The results of a small benchmark test of the two versions were astonishing. The CPU-Version is 800 times faster than the CUDA version.

My first guess is, that the CUDA code is not really executed on the GPU, so here are my questions:

How can I control, that the kernel execution really runs on the GPU?
Are any tools available for measuring the GPU performance like the taskmanager?

Many thanks for any replies.

Someone correct me on this one but I think the kernel always runs on the device (GPU) unless compiled in emulation mode (check if you’re not doing emu builds). Also, maybe your code is not written very well? Care to post it?

I am almost sure, that my code is not written very well. There is to much copy between CPU and GPU. In future development steps I will reduce the copying.

But your hint with the emulation mode was right.
Now my CUDA code is only 30 times slower than the CPU code.

Thank you very much! :clap: