I am hoping to get access to eight separate GPUs by using four 9800GX2 cards. My computations can be distributed over these eight GPUs without requiring communication between the different cards (or between the two GPUs on each card).
So far, I have not found any information from others who have tried this before. Cooling and power are obvious problems that can probably be dealt with. Any ideas on problems that I can expect on the software side? In particular, I am looking for more information on the following issues:
(1) Will the (non-SLI) driver see the four 9800GX2 cards as eight independent GPUs? Or is there a maximum limit built into the current WinXP drivers for the non-SLI mode?
(2) Is there a maximum GPU count limit built into CUDA?
(3) I know early versions of CUDA required a separate CPU core for each GPU (so four GPUs would be the maximum on a quad-core CPU). Is this still the case?
I would greatly appreciate detailed comments on these issues from an NVIDIA staff member.
I sound like a broken record on this issue, but I have yet to hear of anyone fitting 4 double-wide PCI express cards into an ATX case. The card in the last slot will extend past the edge of motherboard, and probably hit the bottom of the case.
(If anyone knows of a computer case which does not have this problem, please let me know.)
If it is not a requirement, but just a recommendation, that would be fine for me, as I have very little CPU activity throughout the computations. Putting two threads on each core would be fine for me.
I also recall (but maybe I’m wrong) that in the early days of CUDA, each kernel execution blocked the CPU thread. Maybe the requirement comes from that version and is not present in the newer versions…
It would be great if someone from NVIDIA could clarify this?
No, really not. Only that someone here in the forum reported performance problems running more GPUs then CPU cores, solving the problem replacing a duo core with a quad core. Also the fact that I always see one core 100% busy each time I run my CUDA applications (one GPU). But I have not seen this confirmed by Nvidia.
Why not perform a test with one core and two GPUs?
Sorry, I don’t have a clear reference. Most of the posts that discuss this are kind of old and hard to find. But, I’ll just describe the issues here and you can make your own judgment call: just consider yourself warned.
The problem has nothing to do with the CPU calculations your application performs. CUDA uses 100% or 1 CPU core whenever you perform a cudaThreadSynchronize(), or an implicit sync is performed for you (i.e. when doing a memcpy or you make more than 16 kernel calls in a row). The reason it performs this is simple: there is very little latency in detecting when the kernel completes. (well, some people don’t think it is little: http://forums.nvidia.com/index.php?showtopic=62610). Hence, the recommendation by nearly everyone on these forums to have 1 CPU core per GPU so they can all busy wait together without problems.
Is it an absolute 100% requirement? Not in all circumstances. If you make lots of short kernel calls or memcpys and thus have lots of implicit syncs occuring I would say it is a requirement. There is one post (I wish I could find it) where a particular user had their 2 GPU code executing slower than the 1 GPU version in a one CPU system. Once they upgraded to 2 CPU cores, the problem went away and performance doubled as expected.
But, if your kernel calls take seconds a few extra ms of polling can’t hurt too much so you can get by with fewer cores (more on polling below).
The only way to work around the 100% CPU utilization is to insert events into your streams and then write your own busy wait loop with a short sleep in it using the stream query facility. This obviously increases the latency for detecting when you reach that point in the stream, but because of the sleep the overhead for having 2 busy loops on one CPU core should be minimal.
Also, I might add that with that many GPU’s in a single case you are going to need a monster of a power supply, massive cooling, and I wouldn’t trust it not to overheat anywhere but in an air conditioned server room. They are more expensive, but have you considered a couple S870 or D870 units to obtain the same number of GPUs per workstation? They at least have their own power supplies and cooling.
MisterAnderson42: Thanks a lot! This really clarifies things for me. I agree that buying a stack of Tesla boxes would probably be a better/safer choice, although I think you will need at least three D870s to match the (potential, ) performance of four 9800GX2 cards.
There is also a certain “fun factor” involved in this, so I may try it temporarily, just to see if it works :) … and then take two cards out and put them in a second PC.
No problem. And I can’t argue with the fun factor. Upload a digital picture of the inside of the case when you get it running so we can all see :) Kill-a-watt measurements of the systems power could be entertaining too, if you have access to one.
Maybe someone has some experience with streams. I am thinking of using 2 GPU’s per processor core (2 cores have to do other things in my case)
I was thinking of doing the following :
receive data from another machine to be processed
determine if it has to go to GPU1 or 2
insert memcopy into the stream for GPU#
insert kernel calls into the stream (about 8)
insert memcopy into the stream for results
insert event
check if event on GPU1 or GPU 2 is reached
if event is reached, transfer results to other machine
if data to be processed has been received, do the things above, otherwise check again for an event in queue for GPU1 or 2
Like that I have a busy loop that is doing 3 things:
-receiving data
-filling the streams
-moving results to other machine (when memcopy is finished)
As I have very little experience in this area (I am a MATLAB coder basically), I would like to know if people already see trouble with this setup. It is for a realtime system, so normally I will send and receive data while the kernels are running, so my busy loop will actually be doing interesting stuff. When it does not receive stuff anymore, my kernel should be finished, and otherwise I don’t mind to busyloop.
No offense taken, but as you yourself posted a message about putting 2 ‘GPU-threads’ on 1 CPU-core my post seems quite relevant to me (using streams like this may be a solution for the 1 core per GPU ‘problem’ that I will also encounter if I will swap my 2x 8800GTX for 2x 9800GX2)
Since you are managing your own busyloop, this should work out without too many problems. 8 kernel calls shouldn’t trigger an implicit sync, but I’ve never tested the queue depth with the streaming API.