Emulation mode, double precision and power issues

har91 · January 11, 2009, 12:37pm

While this is an old topic, now that we can buy double precision cards maybe addressable differently.

The larger picture:
(obviously none of this is an issue for single precision cards)

we are trying to add double precision cuda support to our class-room (computer labs) as well as to our clusters. At the beginning this task seemed trivial but now going through many iterations of cards vs workstation vs servers, the solution remains elusive.

a, yes, we can do S1070 - but due the price, not in large quantities.

b, C1060 - for labs, you need an extra card to drive the display - so you need something with the same driver aka - you need a workstation that has enough power to drive both. In addition, as far as a cluster solution, most rackable servers have half length PCI. Any tested C1060 workstations out there other than the specialty stores like Amex?

c, gtx280 and gtx285 - these maybe okay for workstations. Looks like a Dell R5400 could handle it. But no assurance and has that weirdness about wiring the PCI 16x as 8x. Any tested workstations out there that use these cards for sure?

so we come to emulation.

a, no NVIDIA card - do we need to download the driver specific to the card “in mind” to develop and do emulation?
b, with existing NVIDIA card - like a Mac Laptop, or a server with an FX1700 – if you wanted to do dual precision in emulation - can this be solved or you are stuck with the driver appropriate to the card?

best,
j

Ocire · January 11, 2009, 1:20pm

i have an old hp workstation (not sure about the name at the moment ;-)) with two single-core-opterons and ddr1-ram. i doubt, that this thing has a “real” x16 pci-e slot but my gtx260 just works.
believe me, you don’t have to worry about that. every slot you can fit that card into will do the job. no weirdness, no problems, no nothing.
as for the power problem, see my answer to you other thread.
now to the emulation: i’ve never tried double precision. :-(

seibert · January 11, 2009, 2:45pm

I did a little bit of looking at this for our rackmount cluster, and found that none of our 1 or 2U servers could supply enough power to drive a computationally-useful CUDA device (even a single slot card like the 9800 GT).

In fact, this issue tends to plague pre-build workstations as well: Many of them lack the power supply capacity to drive a double-slot card like the GTX 260/280/Tesla C1060. I’ve hand-built our CUDA workstations in order to ensure the power and cooling were adequate for two double-slot cards.

Emulation is not nearly as hardware-specific as the name makes it sound. In reality, emulation mode just compiles your device code to run natively on the CPU and spawns a bunch of host threads to do the work. It makes no attempt to “emulate” the timing or hardware behavior of a real CUDA device. Correct operation of your code in emulation is strongly correlated with correct operation in real hardware. However, there are cases where race conditions in device code will not manifest in emulation, but will lead to incorrect behavior when run on a real CUDA device. It is more difficult to write code that works correctly on real hardware that will fail in emulation, but it is possible. (usually it involves making assumptions about warpSize) In addition, due to differences in the way floating point is handled on the CPU, you may not get exactly the same numerical results as code run on the GPU. (but that’s a bigger issue than CUDA, really)

As far as development goes, emulation is handy for checking your algorithm’s basic operation, printf-tracing of the code flow, and checking for memory access errors using standard tools like valgrind.

har91 · January 11, 2009, 3:04pm

while it is not pretty, i was looking at the modular power supplies Tigerdirect is selling. For $30 you can add a 600W mini powersupply that can drive the above cards. In a machine room I could do this as we have full control of the space, but with student labs that may be uncool.

I think we’ll do a pair of the S1070 in machineroom with any 1u server.

And take a chance in labs with the Dell Tr5400 and order a few different cards and see how it goes. I assume by Spring vendors will up the power supplies to accommodate the new cards.

Previous poster had it right - just go for it and try.

j

E.D_Riedijk · January 11, 2009, 7:12pm

If you search on youtube for nvidiatesla, you get lots of videos from SC’08. One is a guy from Dell explaining they will offer workstations with a C1060 in them. So I think that may be your best option, it will be fully supported by Dell.

alex_dubinsky · January 12, 2009, 4:51pm

There is nothing wrong with using a gtx280 in a workstation (it is a C1060 with less ram), and x8 isn’t much problem either. If you need a pre-built Dell that will handle it, go for one of the gaming models. There’s nothing wrong with those either, they just doesn’t have ECC (which you don’t need for a Lab computer that you’ll be restarting every night). However, auxiliary PSUs are a great solution.

And as was said, Emulation is a poor way of having your students program. It ignores many simple errors, and may very well not work on a real card. Moreover, it ignores all performance-minded aspects of CUDA. There won’t be any performance difference from a very poor implementation and a very efficient one (a three orders of magnitude difference just won’t show up). On top of everything, emulation is excruciatingly slow (for no obvious reason).

mfatica · January 12, 2009, 5:36pm

The C1060 is more workstation friendly , it needs 2 6-pin auxiliary power connectors while the GTX 26x needs 1 8-pin and 1 6-pin power connectors.

tmurray · January 12, 2009, 6:07pm

correction: GTX 28x needs 8+6, GTX 26x needs 6+6.

E.D_Riedijk · January 12, 2009, 7:23pm

FX4800 only needs 1 6pin connector, so when power is a problem, that can help (although a bigger PSU is probably cheaper)

seibert · January 13, 2009, 7:14pm

That’s kind of surprising, given how similar the FX4800 is to the GTX 260 (with two 6-pin connectors). Is this just due to the lower clocked memory on the FX 4800?

E.D_Riedijk · January 13, 2009, 7:41pm

The FX4800 is 55nm, maybe that is the difference. Or maybe it’s because the less power hungry chips are selected for the Quadro’s? I have no idea, I was just pleasantly surprised :)

seibert · January 13, 2009, 8:31pm

Ahh, I hadn’t realized the FX 4800 was using the 55 nm process. That makes sense…

tmurray · January 13, 2009, 9:35pm

Tesla C1060/S1070 have always been 55nm. Not sure about Quadro, but I think they’ve always been on 55nm as well.

E.D_Riedijk · January 14, 2009, 6:19am

?? Are you sure about that? The fact they came out together with GTX260 & GTX280 suggest that they use the same 65nm process.