I have developed a couple of kernels for a project to process some images. I have a three year old machine with a GTX 260, and they execute just fine in around 7.8ms (not counting all the memory copying between host and device). However, when I take the same executable to my client, which has a top of the line new PC with a Quadro 290, the kernel execution takes over 400ms! (I assume 290 should be faster than 260.)
I have installed the same version of drivers (190.38) and CUDA (2.3) on both machines.
Does anyone have any ideas what I should be looking at to explain this and fix it? This is not expected, is it?
Thanks for your assistance.