compitable servers for S1070 collect some information

Hi All:

Our department wants to buy a Tesla S1070. I know there is a compatibility list in the CUDA website, however, I still want anyone to help me confirm particular systems can work. Since we have already met troubles even the server is in the list, such as R410. So if you have already used the Tesla S1070 well, please tell me your server information. If it needs some special requirements, please also let me know. I really do not want to buy a server that finally cannot use the Tesla S1070 again :( Thanks very much!!

no one uses S1070??

Hi,

We’re currently using 20 S1070 connected to 10 Super Micro machines (7046GT-TRF7046GT-TRF-TC4) with varying RAM sizes (average 36GB).

All running linux and seems to be working fine (we’ve just recently installed them).

Currently we’re facing issues connecting 3 S1070 per machine (using nVidia’s DHIC card). It would have been nice if we could connect 4 S1070

(making it 16 GPUs per machine) - but I still don’t know if it is possible and still having problems with connecting 3 as I said.

eyal

Nice!!! Thanks!!!

Since we only consider one GPU-per machine currently, it looks will be fine. However, could you point out which one exactly in this list http://www.nvidia.com/object/tesla_compati…orms.html#s1070 ,please? Since I have no idea about the numbers you give me (7046GT-TRF7046GT-TRF-TC4)… Thanks again!

Sorry the number got merged :)

7046GT-TRF:

http://www.supermicro.com/products/system/…-7046GT-TRF.cfm

7046GT-TRF-TC4:

http://www.supermicro.com/products/system/…TRF.cfm?GPU=TC4

Why one one GPU per machine? you mean one S1070 per machine? why not 2 S1070?

eyal

Thanks a lot! Actually our department currently only wants to buy one S1070… mostly just for research purpose :) You have so many S1070, it should be really cool for computing!

I have several SuperServer 6016GT-TF and they work very well.
These are 1U solutions, with good pci-e transfer on both slots.

yes… but still I want to know what’s the problem of connecting 3 S1070 you have meet? since I see the website said they can support up to 4

Thank you~ it looks like Super Micro should be reliable.

Currently i fail to run the simpleMultiGPU sample from the SDK. When I use more than 10 GPUs it crashes randomly

saying there is no CUDA enabled device. I’ve opened a bug in the developer site, hope they can assist.

Assuming 3 will go fine, we’ll test 4 S1070. However there might be a problem there since there will be no space/place

for the card for the screen - I’m not sure how that will work.

In anycase I’ll be happy with 3 S1070 - 4 will be amazing… :)

eyal

You could use a GHIC ( Graphic Host Interface card) in one of the slots.
It is like a normal HIC for the S1070 but it has an integrated GPU and display output.
I will ask the driver team if there is a limit in the number of GPUs that the driver can enumerate. It could also be a BIOS issue.

Thanks for the fast answer - we’re trying to update the BIOS this week - its from last september.

Using the GHIC - will it come instead of one S1070? so that way i can only put 3 S1070 and one GHIC?

thanks

eyal

To connect an S1070, you need two HICs, each driving 2 of GPUs inside the S1070 vis the thick cable.
The GHIC is just like the normal HIC but it has on-board GPU to drive a display and an additional connector for a monitor.

Hi,

We’re currently using the DHIC cards so we should be able to therotically connect 4 S1070, right?

If I put 4 S1070 with 4 DHIC cards, can I still use the GHIC? I guess not… the GHIC will require a slot of its own, right?

Anyway here’s the bug reference for the 3 S1070 and simpleMultiGPU test I ran (#642453):

Synopsis: simpleMultiGPU fails with 12 GPUs (3 S1070) We’re using a SuperMicro host ( 7046GT-TRF ) connected to 3 S1070 (total of 12 GPus).

When running the simpleMultiGPU code from the SDK with 8 GPUs everything runs fine.

When running with 12 GPUs we get the following error:

RUN 1:

-bash-3.2$ ./simpleMultiGPU

CUDA-capable device count: 12

main(): generating input data…

main(): waiting for GPU results…

Running kernel on device [0]

Running kernel on device [1]

Running kernel on device [2]

Running kernel on device [8]

Running kernel on device [10]

Running kernel on device [5]

Running kernel on device [3]

Running kernel on device [4]

Running kernel on device [9]

Running kernel on device [7]

Device [6] failed

Device [11] failed

Running kernel on device [6]

cutilCheckMsg() CUTIL CUDA error: reduceKernel() execution failed.

in file <simpleMultiGPU.cpp>, line 74 : no CUDA-capable device is available.

RUN 2:

-bash-3.2$ ./simpleMultiGPU

CUDA-capable device count: 12

main(): generating input data…

main(): waiting for GPU results…

Running kernel on device [0]

Running kernel on device [1]

Running kernel on device [3]

Running kernel on device [5]

Running kernel on device [7]

Running kernel on device [9]

Running kernel on device [11]

Running kernel on device [6]

Running kernel on device [2]

Running kernel on device [4]

Running kernel on device [10]

Device [8] failed

cudaSafeCall() Runtime API error in file <simpleMultiGPU.cpp>, line 62 : no CUDA-capable device is available.

Please advise

-------------------- Additional Information ------------------------ Computer Type: PC System Model Type:

System Model Number:

CPU Type:

Video Memory Type:

Chipset Mfg:

Chipset Type:

Sound Card:

CPU Speed:

Network:

Modem:

North Bridge:

South Bridge:

TV Encoder:

Bus Type: AGP

OS Language:

Application:

Driver Version: cudadriver_2.3_linux_64_190.18 System BIOS Version:

Video BIOS Mfg:

Video BIOS Version:

Direct X Version:

Monitor Type:

Monitor 1:

Monitor 2:

Monitor 3:

Video 1:

Video 2:

Video 3:

Resolution:

Color Depth:

Products: other

Application Version:

Application Setting:

Multithreaded Application: yes

Other open applications:

Release: Public

OS Details:

Problem Category:

How often does problem occur: Every time Video Memory Size:

CPUs (single or multi): 2

RAM (amount & type): 36 ddr3

AGP Aperture Size:

Any update on this?

Did you ever get 3 or 4 1070’s per node?

We have 2 S1070s connected per machine, recently we connected 2 S2050s per machine as well.

Nothing changed since the previous posts… :( but we also didnt push it either too much.

eyal

Are you using the two of the Dual Host Interface Card (DHIC) or four of the HIC?

Currently 4 HICs.

Thanks for the information. We tried four of the HICs on the Tyan S7025 motherboard, but could not get Windows 2008 HPC R2 to work (will work with only 3 of the HIC). Will try the same with the SuperMicro motherboard.

Also, we have some of the DHIC ordered, so if we can get 2 of the 1070’s going, we’ll try getting 4 1070’s per node. With the 1 CPU per GPU recommendation, 2 1070’s (8 GPUs) is all that’s recommended for a two socket node.

We now have, in production, at least 10 supermicro machines with 2 S1070 connected to it and running ~24x7 for the last 6-8 months. No problems !! :)

It does use linux, however and not Windows.

Also bear in mind that the DHIC will cut the PCI bandwidth in half so if this is a bottleneck in your application the DHIC will make it worse.

eyal