Upgrades Looking for your experience related to performance before shelling out

Hey all,

So now that I have my code in an almost-fully-functional condition (will it ever actually be?? – stupid pointers),
I am considering upgrading my old 8800 GTS 320mb to a gtx 275.

I am trying to justify the purchase with an idea of how much performance I will gain on my code, thus making my thesis that much more appealing, and so i have a few questions:

Background info: my code right now uses 32 registers per thread, and around 14400kb of shared memory per block. My most optimal config is 256 threads in 96 blocks. (33% occupancy).

  1. Based on a scaling of clock speeds, # of MPs, and increased register count, it looks like I should get around a 3x performance boost over my 8800 with the gtx 275. Can anyone else confirm similar numbers?

  2. Just considering the 275 on its own now, a single precision code is expected to be how much faster than a double precision version? I have a bandwidth limited code. Considering that bandwidth is halved, and there are only 1 double precision compute units per MP (8 threads), does that mean I will have a 1/16 performance hit in switching to double?

  3. Cooling/overclocked cards: Is running an intense cuda program more or less taxing on the card than running an intense game like crysis? If this card is to be used for heavy cuda-ing, should I get a 275 from a brand that has better cooling, or does it not matter? Also, would it be risky of me to get a factory overclocked card (like the BFG)? I think these all depend upon whether or not cuda is any more or less taxing than games (if cuda is < games, then anything good for games is good for cuda.)

Thanks everyone!

Once I do make the plunge, I’ll let you know what improvements I see.
[Of course, this wont end up being an apples-apples comparison because I am also going to drop an intel i7-920 into the rig (currently have a core 2 duo e6600) and so i would expect the gpu/cpu speedup to be reduced due to the i7 kicking butt. (Im still only going to use one thread though).]

Since you have the 8800 GTS with 320MB of memory, you have the G80 chip according to this chart:


That’s the oldest CUDA-capable chip, so even moving up to a newer 8800 GT would make a difference since the architecture is more refined now. That page also says that your card can theoretically do 346 GFLOPS, whereas the GTX275 can do 1010 GFLOPS.

If you move your code to double-precision, you’re going to get a speed of about 1/8 to 1/16 of what the single-precision code can do. There’s no fixed number anyone’s going to be able to give you, you’ll just have to test your code to see exactly what the performance loss is. I don’t know what you’re working on with CUDA, but depending on the scenario, you may be able to make use of “mixed precision” (search the forums for more info on that).

I’ve heard that the overclocked cards are not a good idea for CUDA, I’m not quite sure why though. If I had to guess, I’d say that the overclocked core and memory may occasionally corrupt your results or something (perhaps even silently, which is worse).

I’ve been thinking about adding a GTX275 to my system for a while now, I may go ahead and take the plunge myself next week. Just make sure that your power supply is able to handle the higher load of the new card (it has a TDP of 219 watts…about 70 watts more than your 8800 GTS).

I read in some university web-site that if your kernel is bandwidth limited, then you will generally run 1/2 the speed of single precision…

If you are arithemtic intense then 1/8th speed is certain.

In the early days of CUDA, there was a lot of discussion about this. The overclocked cards (at least in the G80 day) tended to run hotter, and above 100C, some people started seeing silent memory corruption. Using a consumer card is already a slight risk, as NVIDIA developers will tell you the Tesla gets much better qualtiy control. Further adding to that risk by running your card over spec for a 10-20% speed improvement doesn’t make much sense. You’re much better off figuring out how to use multiple GPUs instead and getting a real performance win.

Also, from what I remember about the overclocked cards, they tended to mostly ramp up the shader clock, but could never do much with the memory clock. For a bandwidth-limited kernel, an overclocked card provides even less benefit.

Thanks for the help!

I haven’t gone down the path of testing double-precision yet, but I built the new computer (i7-920, 6gb ddr3, BFG gtx 275 OC) and have some results to compare with my old one (core 2 duo e6600, 4 gb ddr2, 8800gts 320mb (320-bit bus)).

Comparing the cpu-only version of the code (not deviceemu, but the original CPU code that I have ported to GPU) runs about 25% faster on the new computer. (it was recompiled as well, but same options used).

now, when I run the gpu-version, also recompiled, and using -arch sm_13 on the gtx 275 machine, I see ~1.4x improvement (15 seconds vs 10 seconds).
Keep in mind that although most of the code that is timed runs on the GPU, there is some portion of it still being run on the cpu - I have not yet modified my timing to striclty capture the GPU portion.

Once I determine what % speedup is from the GPU and not the i7 I’ll post them here.

I suppose I am not receiving at least a 2.5x speedup (ratio of MPs on each) because I am limited by the fact that the 8800 was relying on those extra blocks to hide the latency, where on the 275 this is no longer possible since it has a larger block ‘pipeline’
EDIT: this of course means that as i increase to a larger problem with more blocks, i should expect to approach at least the 2.5x mark… But it doesnt… it still stays around 1.25x!!!

Run your program through the cuda profiler. There you’ll at least see if you now have a cpu bottleneck (remember Amdahl’s law :)). Have you done optimizations that are specific to the G80 architecture, e.g. aligning global memory access to addresses that are a multiple of 16? This can hurt performance on a G200-based card. Experiment with different grid and block sizes. Maybe you have some parameter which allows you to lower the shared memory usage per block? Several smaller blocks that run simultaneously on a processor can be faster than one block per processor. Also, maybe you are limited by bandwith? In that case the maximum speedup over your old card would be more like 2 times, not 3 times.

Oh yeah, I guess I am pushing on the Amdahl barrier… Its looking like Ahmdal’s says I can only get to ~15-17x speedup. I’m actually at 14x. So I’m not too upset!

Also, now that I’m on the GTX275, I forgt that it gives me more info than the 8800 did in the visual profiler! Looks like I’m getting 3GB/sec, lol. Oh well. its a horribly inefficient algorithm (by nature) for coalescing. Any thread can access any memory location of the input data. It all depends on previously drawn random numbers.

Thanks for the input…

I think my next step, once I’m ready, is to figure out if I can reduce the amount of serialized work that gets done, I’m just not happy about doing it because it is an area where the threads DO depend upon each other. I know it can be done, I’ve just been avoiding it thus far.

EDIT: I want to add that the GTX275/i7 code is now up to 2.2x as fast as the 8800GTS/C2Duo code. I did not yet attack the serial code, but what I discovered was that I had some functions, like cos() in the src that were not cosf(). So, when I compiled with arch sm_13, they were interpreted as the double functions instead of being converted to the single-precision equivalent. That was a large chunk of the slow down. Phew!

This doesn’t ring any bells - what have I missed? Can you point me in the direction of further information?

I think there is a section in the programming guide about optimizing for specific architectures, but I don’t remember the section number for it and I’m away from my desk.

In any case, there are some optimizations you can make for specific architectures (G80, G92, G200, etc.) with regards to how it accesses memory and such (certain alignments will speed your kernel up).