How to run Fermi and non-Fermi cards in the same system ? In other words, how to migrate to CUDA 3.0

Hi All,

I’m running the system with three GTX 295 cards using the CUDA 2.3 package. Now I got one GTX 480 and would like to compare the productivity of GTX 295 and GTX 480 on my tasks in order to find out if upgrade to Fermi-based cards makes sense.

What I’m going to do is to remove one GTX 295 and install GTX 480 instead so I could run the same benchmarking code on different cards simultaneously. However, before actually breaking the system that works, I would appreciate any suggestions on how to do it in a right and painless way.

If someone had already tried something similar out, could you please share your experience ?

Thanks in advance.

  1. It could be sensitive to your kind of work.

  2. if you r runing multi-GPU thing – you need to load-balance correctly (OR) your chain is as good as its weakest link

You shouldn’t run into any troubles, software-wise. Drivers, runtime, and all fully support mixed systems like this. You will need to run deviceQuery to determine the device index for the GTX 480, or use cudaChooseDevice in your app.

The one thing you will want to be careful of is power requirements. The GTX 480 may pull more power than a 295, I don’t know off the top of my head. But if you’ve got a stable tri-GTX295 system, you probably already have a beast of a power supply :)

Seems like CUDA applications do not load the GPU as extensively as games or special burning tests so my 1000 Watts PSU seems enough for the whole system with three GTX 295 :-)

Must the .cu code be compiled with some special options to make it applicable to both GTX295 and GTX480 hardware ? The code is ‘old’, it knows nothing about CUDA 3.0 features and Fermi.

Thanks in advance.

From my experience, simple code that does not involve texture/constant memory can compile with sm_13 and work in both GT200 and Fermi cards. Complicated code has to compile with sm_20 for Fermi and sm_13 for GT200

I have a 3xGTX 295 + 1xGTX 470 system here running CUDA 3.0. You can compile everything as normal and it will work on any of the devices. If you need special sm_20 features (the caching is all automatic, so you don’t have to do anything), then you have to use the -arch flag, and then the code won’t work on the GTX 295 anymore.

Understood, thank you!

How do you feel - is Fermi profitable for your tasks ?

So far, I’ve only been using it for existing code and experimenting with microbenchmarks, and it runs fine. I still haven’t started writing any new code yet to take advantage of the Fermi features. All my current CUDA applications were written because they solve problems that the previous compute capabilities were good at. It’s stuff I would not have previously considered doing with CUDA where I think Fermi will shine.

(Ok, that’s not quite true. I do have one application that I’m trying to turn into a general purpose library for physicists that ran fine on the GTX 285, but can make good use of the multi-kernel feature of Fermi with no device code changes.)

Very much so. The “plug and play” performance improvement (running exactly the same code as on G200) yields a 60% performance boost in hoomd. I am beginning work on some Fermi-specific optimizations that are showing promise to boost performance an additional 20-30% over that.

It means that overall boost should be about 90% … not too much comparing to G200. One G200 GPU has 240 cores, one GF100 GPU has 470 cores (96% more cores). Considering all the improvements that GF100 contains (at least according to the documentation) it is a bit disappointing to see that GF100 is no more faster than the ratio of cores number.

On the contrary, a 60% plug and play perf boost is an amazing result. My application is entirely bandwidth limited, and GTX 480 only provides 11% more bandwidth than a GTX 285. The wonder of the L2 cache makes that bandwidth more effective and enables that 60% perf boost. The additional 20-30% I’m hopeful for will come from the L1 cache.

If you want to correlate core counts with performance improvements, look to FLOP limited codes.

I see.

My task is limited by computations and productivity of texture cache, I expect qualitative leap with Fermi (especially with L1 and L2 caches), going to find out if my hopes are reasonable :-)

The texture cache and the L1 / L2 caches are completely separate, I believe.

Absolutely. What I meant is that currently I’m using textures to prefetch data (the only way of prefetching available on GT200-based hardware) and going to use cache of GF100 instead.

We know that the texture cache and L1 are separate. But what about L2? Do texture reads go through the L2 cache and into the texture cache? It is not explicitly documented anywhere. However, it is the only way I can think of explain the 60% perf boost in hoomd (which heavily uses semi-random tex reads) with only 11% more bandwidth from GTX 285 to 480.