50x slowdown on one machine vs. another

I have developed a couple of kernels for a project to process some images. I have a three year old machine with a GTX 260, and they execute just fine in around 7.8ms (not counting all the memory copying between host and device). However, when I take the same executable to my client, which has a top of the line new PC with a Quadro 290, the kernel execution takes over 400ms! (I assume 290 should be faster than 260.)

I have installed the same version of drivers (190.38) and CUDA (2.3) on both machines.

Does anyone have any ideas what I should be looking at to explain this and fix it? This is not expected, is it?

Thanks for your assistance.

Your assumption that the Quadro should be faster is very, very wrong.

The Quadro NVS 290 is a low power display board, optimized for multimonitor displays, not compute horsepower. Its useful for things like airport flight arrival TV screens, or stock tickers on multiple monitors. They’re small and passively cooled so they can be crammed into small form factors.
It has only 16 compute cores (SPs).

The GTX 260 has got 216 cores, at higher frequencies. It’s a “real” GPU.
It’s no comparison.

Ah, thank you. I knew that the 285 and 295 were high performance, but I hadn’t heard of the 290, and just assumed it was similar, because of the numbering being so similar to the others.

NVIDIA model numbering (especially across brands, like between GeForce and Quadro) is very confusing. I find it helpful to check out this page to look up the capabilities of unfamiliar models: