I’ve written a program that does interpolations of image pixel data and assigns the result to an output array. I ran the program on a lower end computer using an 8600 and the CUDA function took about 3.6 seconds to complete. I did the same test on a higher end computer using a GTX 295 and the function took about 6.1 seconds to complete. I’m mainly just looking for some general answers as to if this should be possible and where I should be looking for the reason its happening (execution config, memory management, etc.).
Are there enough thread blocks to utilize the new GPU? Try to compile with ‘-arch sm_13’. To fully load your GTX 295 you need 2 host threads which both execute a kernel, because the GTX 295 has actually 2 GPUs.
Nevertheless I would expect your code should be executed much faster…
If one machine uses Windows Vista, the other Machine Windows XP there can be dramatic performance differences.
Vista has new different driver model, which adds some extra overhead.
If you’re running Linux, forget what I just said.
I once had a program that worked fine on a GT8600 and give weird results on a GTX 280. It turns out thatr I have several blocks writing the same memory locations. the GT8600 could only run a few blocks at the same time so there was no problem, on the GTX 280 however …
So maybe you have a bottleneck somewhere in your code and it gets worse when you have a device capable of running more blocks. Have you tryed to run the profiler?