[Didn’t want to hijack this thread, but it’s related.]
I was wondering if anyone has some experience with using CUDA in a live production environment, with their system basically running 24/7 for months on end.
So far, my personal experience with CUDA has been pretty mixed, with a good chunk of unexplained random GPU crashes and seemingly random VRAM leaks, but I’m more than willing to file that under programmer error. On the other hand, if there is a general stability problem, I’d like to know before I dig myself a nice and comfortable hole. :)
The thing is, I have a fantastic amount of data to process live and going Cell, while being a known beast, is going to be a lot more expensive than buying GPUs.
What I’m most interested in are basically “random” crashes after an algorithm has run for a certain amount of time. If our system goes down, we will lose data and we would be pretty unhappy about that. :)
Is it possible with your algorithm to set up an infrastructure that has a central command node that distributes work to several compute nodes - these compute nodes produce check points (similar to the Folding@Home application) and when they go down, not much is lost. They reboot (watchdog circuit!) and resume crunching.
Other than the extremely rare kernel that triggers the issue in the thread you linked to, I have absolutely no stability issues with CUDA. I’m talking headless servers here w/o X running, sitting in a nice air conditioned server room, and who’s only business is running jobs 24/7. Running HOOMD, I’ve so far got 4 GPUs * several months worth of 24/7 simulations run without a single hiccup.
On systems with a display, there are occasional issues that require a reboot. But I may be falsly blaming CUDA here when my accidental programming mistakes have overwritten essential parts of the GPU’s memory several hours ago and are only now causing difficulties, as the only systems with a display I run CUDA on are development machines.
Anecdotaly, I ran a single Windows XP system with a 280GTX running a single kernel, uninterrupted, for just over two months.
That was approximately 1 billion kernel launches in a row before I stopped the run.
We’ll have the possibility to resubmit work on the cluster, and I’m planning for having at least twice the processing power that is needed, so little hick-ups are not really a problem. Basically, one work unit has to be done in 10 seconds or less, so redoing a single unit is not an issue. We simply have pretty tight latency requirements (or at least we’re planning on having those), which mean that if we go offline for an hour, even if we could redo the work, it may not be of much use any more.