qudra fx 1700 VS tesla c1060 How much performance gain I can expect?

kewumgh · January 13, 2010, 4:31pm

Dear there,

I have a CUDA application , it runs 10 jobs / mcs on my work station. ( Xeon E5405 @ 2.00GHZ, Qudra FX 1700, #cores 32, #MPs 4, clock rate 0.92GHZ). I am thinking about move to a better GPU, like Tesla C1060 (#stream processor cores 240, frequency of processor cores 1.3Ghz)

How much performance gain I can expect? I knew it is depends on lots more, can we just safely predict it?

tks

seibert · January 14, 2010, 2:00am

Is your CUDA kernel limited by memory bandwidth or floating point performance? That tells you which ratio to consider. (And how about PCI-Express performance? That won’t get any better with the Tesla.)

Also, does your kernel have enough blocks to scale to the greater number of multiprocessors on the Tesla?

kewumgh · January 23, 2010, 4:45am

my profiler tells me that my kernel takes 97% of total time. So I think PCI-Express is not a problem.

not clear cuda is bounded by memory bandwidth or floating point performance( I do have lots of global look ups and atomic operations…) how to identify it?

I do need more threads run simultaneously. currently I can only about assign 64(block size) * 256( threads/block) because every threads uses 25 registers. Ideally I need billions of “independent” runs.

seibert · January 23, 2010, 8:32pm

I’m not sure of a good way to do this in general. If you can figure out the number of FLOPS and the number of bytes read/write from global memory, you can compare those to the theoretical maximum for your card. Whichever is closer to the max is probably the bottleneck.

Alternatively, in simple kernels, I’ve taken the section of code which does the main calculation, or the part that does the main read or write operation and copied it so the kernel does it twice. Then I measure how the kernel time goes up.

You mention atomic operations, and that gets tricky. I have no idea how atomic memory performance scales between cards…