NF200 chips are just switches… they don’t add bandwidth, they just allow lanes to be switched dynamically as load varies.
It’s sort of like your ethernet router allowing multiple machines to access your cable modem… each computer can draw the full bandwidth of your internet connection as long as nobody else is, but the router doesn’t actually add any new bandwidth.
The mere presence of such switches on a motherboard shows that there’s not new PCIe lanes with more bandwidth, just the same lanes being reused.
It only has a single X58 hub, and it still only has 32 lanes for GPUs. It should be no different to any other single socket X58 motherboard with NF200s switching 16 lanes between pairs of PCI-e slots. Any GPU should be able to hit peak bandwidth by itself, but GPUs behind the same NF200 won’t be able to do so simultaneously.
So, is this more-or-less a software issue? I’m not familiar with NUMA beyond the basics of how it works.
It seems to me that if you had the driver run one thread per physical CPU, and allocated host memory in the memory space of the CPU that is controlling the device you’re going to transfer to, the problem should be pretty much alleviated…is that not workable? I understand that for a really complex program, you might have issues transferring memory that resides in one CPU’s address space to a GPU controlled by the other CPU, but barring that…
Even with correct NUMA affinity, things don’t appear to work as they should on these dual Tylersburg motherboards, which makes me think it is probably a BIOS interrupt routing problem.
If you go for just two Fermi, then you don’t need the NF200 at all. You can feed them each an x16 link directly from the chipset. If you go to 3 or 4 Fermi (or whatever GPU), then the NF200 will ensure that each GPU at least has the potential for full bandwidth if the other devices are not transferring any data.
(Incidentally, either the NF200 or something just like it is used in the GTX 295 to link each GPU to the shared PCI-Express connector. If you bandwidth test either half of a GTX 295 alone, you see full performance.)
Generally, yes; However, if 2 GPUs is all you want, better to put each directly connected to the X58 hub. The nf200 does add a bit of latency.
I still like the EVGA board. I just hope it comes with lifetime warranty. I wonder if the dual-socket dual-X58 boards have problems beacuse of BIOS issues, or a more mundane design problem with the northbridge itself.
I use ASUS P6T7 Supercomputer to plug 2 GTX 285 and 2 PCI-E 4X interface frame grabber(for 4 camera link base industrial camera)
24GB DDR3 memory.
Why I do not use Xeon ? of course, C/P, I use gpu to be the main calulation core, CPU is not important for me now.
4 core Core i7 is enough.
I think that it uses 40 pci-e 2.0 lanes, does it?
So it still can run on full speed (both 5gB/s up/down for both GTX 285), can it?
I also have 2 GTX 295
If I change my 2 GTX 285 to 2 GTX 295. And transfer data to these 4 core simutaneously, will it be slower?
(Because I do not have so high power power supplier, I hope some one can give me some suggestion before I buy a over 1300W power for them)
Of coure, if it can still the same fast, I hope I can find a MB for me to plug 2GTX 295 and 4 PCI-E 4X Frame Grabber(use 48 lanes) so that I can process 8 camera’s data in one pc.
but I think becasue GTX sereious need double space for it. If I make it, it has to use 8 **** space. It seems to be impossible. right?
You’ll have to check your slot config to make sure nothing else is connected to the two nForce200 chips. Basically, on that board, you want each GPU to be connected to a separate nForce200 chip, and anything else connected directly to the northbridge. Should you have the framegrabbers on the same chip, they may eat up precious bandwidth. This only applies if you have two cards.
Generally speaking, yes, each half of a 295 shares bandwidth with the other half, so simultaneus transfers will incur bandwitdth bottlenecks, still, I wouldn’t worry too much about that, nd definitely go for the 295s.
Why not put both 285s and both 295s at the same time?
Thermaltake makes a very good 1200W PSU that should easily handle a trio of 295s on a X58 platform. They also make a 1500W model, but that’s just for European markets.
Or you could use a separate PSU for the GPUs. Do a Google search for “GPU folding rack” to see how it can be done.
It should be possible if you manage to mount the 295’s externally, in a similar fashion to a Tesla S1070. Don’t put too much hope in finding an 8-slot motherboard though. I’d put more hope in finding framegrabbers with more inputs.
You can do this even with the 7-slot P6T7 motherboard by putting one of the GTX 295 cards on the bottom slot (closest to the edge). It will extend over some non-essential connectors, but will fit. (This is how people do quad GTX 295 systems with this board.) The trick is to find a case that has 8 slots cut in the back. Both Lian-Li and Antec make cases with 8 or more rear slots which should work in this configuration.
Yes we will be manufacturing a tray for this and other SSI MEB motherboards.
I’m surprised that wasn’t mentioned by the author of the article.
At any rate the answer is yes we have collaborated with EATX and will be getting trays made prior to the board release. It will fit all 10PCI back panels and thus 6 cases.
I think I’m going to call Tyan tomorrow and ask them to test out their board to make sure it works like they claim (or they can send me one and I’ll test it for them). I’m thinking about building a rack-mounted compute server in a couple of months (perhaps I’ll wait for Fermi…) and it would be nice to have maximum PCIe bandwidth to each card.
Currently we successfuly connected 2 S1070 to the machine (either with 4 HIC cards or with 2 DHIC cards)
Connecting 3 S1070 failed on the 10th GPU. I’ve opened a bug report #642453
Synopsis: simpleMultiGPU fails with 12 GPUs (3 S1070) We're using a SuperMicro host ( 7046GT-TRF ) connected to 3 S1070 (total of 12 GPus).
When running the simpleMultiGPU code from the SDK with 8 GPUs everything runs fine.
When running with 12 GPUs we get the following error:
RUN 1:
-bash-3.2$ ./simpleMultiGPU
CUDA-capable device count: 12
main(): generating input data...
main(): waiting for GPU results...
Running kernel on device [0]
Running kernel on device [1]
Running kernel on device [2]
Running kernel on device [8]
Running kernel on device [10]
Running kernel on device [5]
Running kernel on device [3]
Running kernel on device [4]
Running kernel on device [9]
Running kernel on device [7]
Device [6] failed
Device [11] failed
Running kernel on device [6]
cutilCheckMsg() CUTIL CUDA error: reduceKernel() execution failed.
in file <simpleMultiGPU.cpp>, line 74 : no CUDA-capable device is available.
RUN 2:
-bash-3.2$ ./simpleMultiGPU
CUDA-capable device count: 12
main(): generating input data...
main(): waiting for GPU results...
Running kernel on device [0]
Running kernel on device [1]
Running kernel on device [3]
Running kernel on device [5]
Running kernel on device [7]
Running kernel on device [9]
Running kernel on device [11]
Running kernel on device [6]
Running kernel on device [2]
Running kernel on device [4]
Running kernel on device [10]
Device [8] failed
cudaSafeCall() Runtime API error in file <simpleMultiGPU.cpp>, line 62 : no CUDA-capable device is available.
Please advise
-------------------- Additional Information ------------------------ Computer Type: PC System Model Type:
System Model Number:
CPU Type:
Video Memory Type:
Chipset Mfg:
Chipset Type:
Sound Card:
CPU Speed:
Network:
Modem:
North Bridge:
South Bridge:
TV Encoder:
Bus Type: AGP
OS Language:
Application:
Driver Version: cudadriver_2.3_linux_64_190.18 System BIOS Version:
Video BIOS Mfg:
Video BIOS Version:
Direct X Version:
Monitor Type:
Monitor 1:
Monitor 2:
Monitor 3:
Video 1:
Video 2:
Video 3:
Resolution:
Color Depth:
Products: other
Application Version:
Application Setting:
Multithreaded Application: yes
Other open applications:
Release: Public
OS Details:
Problem Category:
How often does problem occur: Every time Video Memory Size:
CPUs (single or multi): 2
RAM (amount & type): 36 ddr3
AGP Aperture Size: