CUDA on Dell Server with Virtualization

Hello All,

A newbie here looking for some hardware advice. I am starting to mess around with CUDA for a couple of our MatLab and C based apps. I was going to buy a GTX 280 and put it in one of our DELL 2900 severs. Is that a good idea? Long-term , should I build my own box or will the Dell 2900 architecture do the job?

The Dell 2900 has dual Quad cores and 16GB of ram. The base OS is Windows Server 2008 but we have several virtual machines on top of that. Any known issues with virtual machines?

I will probably build an XP virtual machine specfically for CUDA and allocate it about 4GB of RAM and 2 processors. Of course I could build almost any OS/Ram/processor virtual config, so any suggestions would be appreciated.

Thanks,
Mike

CUDA will not work inside of a virtual machine. The GPU must be fully visible to the OS.

Well, I don’t have a specific answer to your question, but I’ve also done some stuff with server virtualization, and I don’t see how it could work if you plan on running code inside a virtual machine (since the virtual machine doesn’t have access to the “real” machine’s video card.

However, if you’re going to run the code directly on the box, I suppose it could work, but you might run into some performance hits if your virtual machines start taking up a lot of CPU time (and affecting the host/device transfer speed of your CUDA program).

So, short answer – it’s probably a better idea to just build a separate box for this. Remember that CUDA kernels are limited to 5 seconds of runtime if you have a display attached to the card – so make sure to get something with integrated graphics, or get an NVS card (PCI Express x1, I think) for the monitor and the GTX280 to run the CUDA code.

EDIT: Dang, netllama beat me to it…

Wow, thanks for the lightning fast replies.

Sounds like virtualization is out. I have another Dell 2900 with a similar hardware config running Windows Server 2003 (no virtualization) and integrated video. Should I give it a shot or just give up on the server idea?

Thanks,
Mike

That sounds like it would work. Just make sure you connect the display to the integrated video, so that CUDA is free to use the GTX280 for as long as it needs.

Does that happen to be the PC architecture? Or has the Dell marketing department made up its own ;)

Anyway, what matters is if you have a free PCIe x16 slot, if you have the room for a 10.5"-long double-slot card, and if your PSU has a spare 230W. That 280 is a beast.

Re: integrated graphics. It’ll only work, I believe, if the integrated chip is also NVIDIA. Otherwise, Windows will have driver conflicts. (Solution would be to just disable integrated vid in the BIOS. The 5 second limit isn’t a concern to most people, and CUDA developers ought to write code that doesn’t hit it.)

P.S. It’s also dumb to dedicate an expensive, powerful production-level box to development. Just Newegg a PC for $700. But… everyone’s priorities resource-wise are different.

WIth virutalization catching up, NVIDIA should consider giving hardware-abstractor support for VMWare.

VMWare supports networking for host OSes… i.e. Guest OSes can access real networking hardware and VMWare helps them to map their virtual resource to the physical resource. NVIDIA could provide that support to VMWare.

The problem is, it’s not always optimal to execute in less than 5 second increments. And 5 seconds based on the standards of execution on which card? The 8500 or the 280? Or the Tesla?

If we have to go with the lowest common denominator, then that really restricts 280 developers.

Here’s my question, though: Why is there a 5 second limit? Is there some reason they don’t make it disableable?

The 5 second limit only applies to CUDA-enabled cards that have a display attached. If there is another nVidia-based graphics card in the computer which is running the display, the headless CUDA-enabled card is free to run for as long as it likes.

It has something to do with the display driver in Windows, if I’m not mistaken…so no, it is probably not “disable-able”. However, I wonder if nVidia could add some kind of functionality to pop/push the CUDA computing context whenever the display needs to be refreshed, so that there is no limit to the kernel run times (in fact, the context switch/switchback should be transparent to the kernel).

I heard that in Vista this is adjustable? Plus, Vista has that whole GPU driver virtualization that lets multiple 3D apps share the device. That should enable this exact sort of context switching.

I think this whole issue should be fixeable soon, at least on Vista (and Win2k8 and Win2k8 HPC Edition).

As for VMware-style GPU virtualization. I think it should be possible/easy to give a GPU entirely to a virtual machine (hence no conflicts). I hope NVIDIA is working on that.

Isn’t it VMware that should be working on that? I just installed a nice 2x quadcore 32Gb, VMware ESX server at work today :blink: I would not mind connecting a S1070 to it for use from a computation VM or 2. :D

There are significant technical hurdles in supporting CUDA from a virtual machine from everyone’s (whether it’s VMWare, Xen, kvm, or whatever your environment of choice is) end. We are aware that some people would find it extremely useful, but keep in mind that GPU acceleration from a VM is still in its infancy.

If a VM owns the GPU fully, and the host OS doesn’t mess with it at all, what would be the problems? (Beside corner cases like host entering power-savings mode or guest being suspended/migrated.)

Not that simple, because you’d either need a VM-aware virtual GPU driver as well as a VM-aware host GPU driver or let the VM talk directly to a GPU, which seems like a bad idea. So yeah, it’s not trivial.

That’s what I meant. Why is it a bad idea? I can see some corner cases and that maybe DMA would be difficult (the VM doesn’t know, I think, hardware RAM addresses). But just handing over a GPU entirely to a vm seems like a great idea.

But yeah, I would think this functionality would be implemented from VMware’s end. PCI bus virtualization… you’d need a special PCI bus driver on the guest and the host, but that’s about it. Would be very cool. I’m sure VMware and everyone else have been trying along these lines, though.

Some1 needs to write lot of code…

Not that simple. As some1 above pointed out, there must b code inside VMWare that would trap all the memory writes to NVIDIA GPU hardware (from the CUDA enabled driver running in the guest OS) and pass on the same to the real device using the actual NVIDIA driver on the host OS. This is not a trivial job. Usually, you use system calls to talk to your driver. But you dont call a driver just to write into a register etc… Do you see the difference? This would invite writing a special driver on the host-side as well. Grr… Lot of issues would crop up.

Also, it is NOT like handling over a device to the guest OS. How can the guest OS handle physical interrupts that actually go to the host OS? All interrutps taken by the guest OS are generated by the Emulator (VMWare or whatever)… You need a dedicated bridge lying in host OS that communites to a VMWare module to get this working.

I think you’re mixing issues. If you wanted to do it the “correct” way, you’d trap memory writes and re-process them (which would be difficult). If you wanted to do it the straight and simple way, you’d let the memory writes go right through to the PCI bus and only worry about incoming interrupts.

(This is surely dangerous from a security/stability perspective, but whatever.)

The interrupts don’t seem like such a big problem. Host receives interrupt, host passes interrupt. Guest reads interrupt and issues PCI bus transactions as normal. The other problem I thought of was when the guest has to pass physical addresses to the GPU’s DMA engine. But this doesn’t seem like a show-stopper either.

I understand. This means your guest OS which is actually a part of emulator will have to be priveleged enough to behave like a user-space device driver. i.e. the emulator applicaiton itself must be privileged enough to program registers directly (which usually is done by device-drivers).

Atleast in Unix, you could technically do it by mapping via /dev/mem – and you have to be super-user to do that. In windows, I am not sure if the OS would allow an application to take that much privilege… Remember, you need to have special emualtor support in the host drivers so it knows that the interrupt had originated because emulator programmed it directly from user-space. It could be much like passing on the interrupt to the emulator application. I have never heard of an application processing an interrupt before… but it all looks exciting (and quite possible as well)

As you say – It is about “interrupts” and “DMA”. If we have a way to do that clean, I think we are in business. Hopefully, some1 does it soon. I dont think it is a big job either. But it definitely is a smart-job.

It really is a good idea to pass on the device directly to the Virtualzation software. Not sure if there are industry-standards to it. But it is high time the virtualization industry has such support. There must be an OS, OS-driver, Virtualizer nexus API to do this. Pretty Good Idea Mr.Alex.

Best REgards,

Sarnath

Just to add some perspective:

Well, Thinking of it…, There are virtualization software available that take virtualization beneath the OS level

The stack looks like :
hardware → virtualization software → All guest OSes

In such case, the hardware can be partitioned to the guest OSes directly. IBM, HP, SUN and other big companies have logical partitioning implemented in their servers.

But there are other kinds of virtualizers available like"VMWare" whose stack looks like:

Hardware → Host OS → Virtualization Software → Guest OS

Not sure about XEN. In the 2nd case, passing the device to the guest OS has problems and that is what we have been discussing here.