Setup Procedure 4xTesla RHEL5 - Multiuser CUDA Programming

mechlab · March 5, 2009, 3:21pm

I/We just got a new box from Colfax and am looking for some advice on configuring the rig for several (possibly simultaneous) users.

Its a AMD AM2+ Phenom QC Model 9950 2.6hz 4mb Cache, 16GB 800MHz DDR2, Dual 300GB 10KRPM drives, & 4 Tesla C1060’s
All of this is currenly running RHEL 5.3

What I’d really like to do is have a setup were several of us can essentially VNC into the box and do development on & run remotely - Think of a class full of students all accessing the same machine (quite possibly at the same time) to do their CUDA homework.

We are going to be developing some image processing applications still image & video and would like to be able to have multiple users working remotely on the same box.

I need some advise on how to facilitate this usage scenario.

I can already ssh in and run non gui SDK apps already but I cant’ run anthing that will use opengl for instance to display an image. which is severely crippling for our needs.

I can connect using VNC but the RHEL SVN server configuration requires manual entry of every user and serves up a Win3.11 quality x-window

How do the Nvidia Folks do this kind of parallel development?
How has anyone else done this?

Thanks in advance,

Erick

kristleifur · March 5, 2009, 3:39pm

A small tip:

Don’t use VNC, use NX a.k.a. Nomachine. Much, much, much faster and nicer. NX doesn’t support OpenGL though AFAIK.

mechlab · March 5, 2009, 5:52pm

I don’t think I want “client side” OpenGL support here though - I want the CUDA box to do all the rendering and essentially display a dumb pixmap on the client side - this is the trouble I have now with the SDK examples like nbody for example…

But thanks for the info - I saw it somewhere else in a thread but steered away from it because VNC server is already built into RHEL5.3 and didn’t require me to install the server app but I guess I’ll check it out a little more to see what I can make work.

HOWEVER

SOMEBODY AT NVIDIA Must have previously dealt with this issue…

Erick

kristleifur · March 5, 2009, 6:45pm

Re. OpenGL - Sorry, I don’t have experience with this so these are just some rambling thoughts.

Technically, you can software render on the server side … Is there actually software OpenGL stuff in the MESA library? I’ve seen it mentioned but nothing concrete …
The TESLA machine behaves a bit like a cluster … in many cases, I imagine you’d pipe the data to a cluster, pull it back and visualise it locally. Seeing as the cluster is the place for large scale number smashing, and the dev client machine is the place for polygon juggling.
What happens if you simply do X11 tunneling with SSH?

tmurray · March 5, 2009, 7:38pm

I’m pretty sure interop doesn’t work with this setup, but you should be able to use CUDA to generate data and then Mesa or whatever to render it over the network.

There should be some fixes in 2.2 to make multi-user/multi-device configurations behave a bit friendlier, so hopefully that will help you out.

mechlab · March 5, 2009, 11:13pm

Same problem with open gl missing:

Run “nbody -benchmark -n=” to measure perfomance.

freeglut (./nbody): Unable to create direct context rendering for window ‘CUDA n-body system’

This may hurt performance.

Required OpenGL extensions missing.Segmentation fault

[root@ml-tesla release]#

netllama · March 5, 2009, 11:21pm

Did you setup X to use the nvidia X driver, and are you running nbody on the X desktop associated with an NVIDIA GPU? I’d like to see an nvidia-bug-report.log from the system where you’re experiencing this problem.

mechlab · March 6, 2009, 3:30am

tmurry pm’d me back that this is an interop issue

So my issue seems to be related to the fact I’m X-Forwarding… I"m hoping sooner or later that we will get that capability - most “super computers” are used by muliple users simultaneously - for our case it’ll just be a pain(impossible) to render remotely (until that mechanism is clarified for me).

Erick

tmurray · March 6, 2009, 3:43am

If you want to render over X forwarding, don’t use interop calls. Simple as that. Those SDK samples make those calls, there’s no reason why they have to make those calls except for performance, so there’s nothing really stopping you from stripping those calls and routing things back through the host (and then it would work over a network again).

ps–that error message is what you get if you try to run nbody over ssh -X