Does anyone have experiences with having two 8800 GTX cards in an Intel Core 2 Duo box and running CUDA applications on them?
I have a gpgpu application (sparse matrix multiplication and some linear algebra) and was wondering if I could run one card from one core and the other card from the other core in such a way that the performance on the two cards is around twice the performance of an individual card.
Note that at first I don’t want to have communication between the cards (that will come later) only to run two separate instances of the same application.
The programming guide in section 22.214.171.124 describes the cudaSetDevice() function, which lets you decide which device each thread is going to talk to. As it mentions in 4.5.2 a given host thread can only talk to one GPU at a time, so to access both cards from a single app, you would need to spawn another thread and call cudaSetDevice() from each before doing anything else. With two separate apps, you just need to call cudaSetDevice() with a different value in each instance of the application.
Incidentally, this brings up a question I have for the experts: Is there an easy way to discover which CUDA device is not already in use when the application starts? Eventually, we will have two cards in one machine (like nogradi is looking into) to run separate applications. It would be very handy if the application could automatically initialize whichever card is free when it starts, and bail out with an error if both cards are in use.
Thanks for the reply, it seems then that running two separate applications is pretty simple. And if one only wants to have communication between cards through CPU memory shared by the two CPU cores (probably this is the only option) then two threads each talking to its own card can be a good solution. Did you try any of this with success?
We only have one card for testing at the moment. (We are waiting for 64-bit drivers before getting a second card.) I have only used CUDA with a multithreaded app in order to do some CPU calculation in parallel with the GPU execution, but not two GPUs at once.
One guy has built a few quad core machines with 3 GPUs in each case. Scroll down in this thread:
Here’s the direct link to one of the multithreaded CUDA kernels in VMD. The VMD threads code wraps the normal pthreads functions, so you ought to be able to mentally change “vmd_thread” to “pthread_” and understand my code:
John, I guess more than 3 cards went through your hands in relation to this project and would like to ask about your experiences with memory failures.
When we used 110 cards (7900 GTX) in 110 nodes (no communication between nodes or cards) with OpenGL + Cg we observed a disturbingly high number of memory failures. From the first shipping around 30-40 had problems. These were sent back to Gigabyte, we received new ones, some of those also had problems, sent back, get new ones, and the whole process took 3-4 iterations.
So I was wondering if you had experiences with a large number of 8800 GTX cards if you’ve seen any memory failures.
Are your memory problems actually hardware, or could these be driver or kernel glitches that just happen to corrupt memory? With the complexity of software these days, it wouldn’t surprise me if a linux or windows kernel or driver bug could manifest itself in terms of memory corruption.
I don’t have any data on memory failures for really large numbers of cards. We have something on the order of 65 NV cards in our lab. 80% of them are GeForce 6800s, the other 20% are 7900s and 8800s. Out of all of those cards, I think we’ve had one hardware failure in the last 3 years, if I recall correctly. The vast majority of these cards are being used for visualization with VMD, where the framebuffer memory ends up holding large volumetric texture maps for electrostatic potential maps, density maps, or other large volumetric data, and for high resolution multisample or stereo display modes. I’d been waiting for CUDA to come out before seriously going after GPGPU since I already had enough fun debugging complex shaders and didn’t want to be subject to the whims of shader compilers for doing real scientific arithmetic. At present we’re only planning on using CUDA for this stuff, which means only the GeForce 8800 class cards are going to get pounded for GPGPU arithmetic. If you guys are having issues with memory reliability, I think that you may want to think about going for the Quadro series cards for really long running computations where that’s more important. My understanding is that NVIDIA tests and certifies the Quadro series hardware themselves, whereas the GeForce hardware is tested by the vendor/brand (e.g. Gigabyte), presumably with less stringency. I think that the extra testing is one of the reasons that the Quadro hardware is priced higher. I’m sure that someone else knows much more about all this than I do though, so don’t take my answer as even remotely definitive, you should probably ask the NVIDIA guys about this specifically.
Unless the whole system is protected with ECC, the error rate for any long running computation can be prettyscary. Ebay had to replace a bunch of CPUs in their huge Sun servers many years ago due to cosmic ray hits corrupting data. The chips had ECC in all but one tiny place in the CPU, and of course that was the cause of their problems. I think they were detecting corruption at a rate of once per month or two, as I vaguely recall. The one good thing about the GPUs is that they run so much faster than the CPUs that the time component of the equation is hopefully very short :-)
I’m not worrying about these cards anymore what I was wondering if the same issues would arrise with 8800’s. So far we have a couple of those and they seem fine.
John, I guess if you use the cards for visualization only you will never see a problem because nobody can notice if a bit in one of the color components of a pixel is flipped. But we use it for gpgpu stuff where it really matters.
Interesting discussion. Did you ever write a more sophisticated test code to determine if you had boards that were giving errors on particular memory cells, or if it was entirely random? If you had hardware faults, I would expect that a pattern would begin to emerge. With CUDA it should be far easier to write various test program akin to cpuburn and memtest86, which are handy tools for testing cluster nodes before using them for real science. If you don’t want to spend the bucks for certified cards e.g. Quadro, you could add code to do consistency checks or memory block checksumming periodically as calculations progress, and checkpoint/restart as needed. Wait till the first petascale supercomputers come online, I can’t even imagine what their initial MTBF rates are going to be… :)
We didn’t write anything more sophisticated, mainly because I don’t know how to write a good tool. Any pointers on “consistency checks or memory block checksumming”? I agree it would make sense to write one for CUDA, as you say memtest is very useful, we actually run that for a day on every node before any serious calculation. Something similar for CUDA would be useful for a lot of people I guess.
For memory testing, memtest86 has to jump through a lot of hoops to defeat CPU caches, and then runs various pattern sequences to find bad address lines and/or bad memory cells. I bet that some of the memtest86 routines could be adapted for use on a GPU with some work. The GPU vendors must certainly have tools like this already. Until the advent of GPGPU they would have had no reason to make them available outside their engineering labs. Even now, they may not want to release their internal tools since such tools often have very device specific code in them. I bet that a small group of people could put together some “gpuburn” or “memtestgpu” type tools for CUDA without too much effort. I’d do it myself but I’m already swamped with other things. If nobody else takes it up, maybe I’ll do it in a month or so when I’m finished with my other more pressing commitments.