I’m curious about if it’s possible to use CUDA to make a virtual machine to run off the GPU? In other words, a software x86-to-cuda interpreter, and if it was possible, what would the performance be?
It would be pretty much impossible I think, and performance would likely be bad.
Each has its own place.
VT-D enabled hardware + Virtualizer support (XEN is more closer to support…) might help to run CUDA on top of Virtual machines
The link below has a list of VT-D enabled hardware and has a nice discussion as well.
Check out this link: http://forums.nvidia.com/index.php?showtopic=78487 (towards the end…)
BUt well, the URL does NOT cover the Latest Hyper-V from Microsoft. Has any1 tried?
I just read the tech-republic PDF – It talks about Synthetic devices – where one can assign devices directly to guest OSes… If that works for GPUs or the TESLA, we should get CUDA going on Virtualization platforms…
Has any1 got this installed? (Note: Hyper-V is only for 64-bit platforms…)
It sounds like what you’re suggesting is running CUDA from the virtual machine, but what I meant was running the virtual machine with CUDA. Or am I wrong?
How could you run a VM on CUDA…??
CUDA cannot run any general purpose OS. So… Hmm… I dont see your point.
Oops… SOrry, I just read “virtualization” and “CUDA” and started blabbering… Sorry about that…
I think GPUs are just not cut out for all that. Look at WARPs. They all execute the same instruction at the same point of time. It just does not fit the model of general purpose computing at all.
Emulating a x86 CPU on a GPU may seems irrelevant, but it could be done, and performance may be great if you emulate SIMULTANEOUSLY a huge number of x86 CPU.
GPU may be efficient in the emulation because there’s a lot of invariant instruction decoding-code that may execute in parallel on a warp, while the resulting executed code for each thread will probably be executed sequentially (huge divergence on the emulated instructions at any point between the different emulated x86 core). The parallelization of the instruction decoding may enable good performance level, especially if you try to hide instruction differences (i.e: goto and mov are finally similar).
You will have many problems with the memory:
- There’s less that 4MB available per SP on actual implementation, limiting emulated code to this space
- Memory accesses will slow down the emulation, because you will end up with at least 1GB/s IO per emulated CPU
(and this is to compare with 100GB/s+ of actual L1 cache) - No cache so any PUSH/POP or use of local frame will rely on GPU main memory (ouch!)
- You will have to add instructions to protect memory areas of one emulated x86 CPU to be overriden by another one
I don’t think it’s undoable, I think that for some specific x86 code it may be doable and using 64threads+ per SM it could even run well, but I seriously doubt that it could compete with actual generation (Core i7) CPU architectures in terme of performances.
To emulate ARM or any-other RISC-oriented Instruction Set may be an interesting use of CUDA, as it’s easier to decode and execute than x86 ISA, and CUDA only exists on x86 architecture (sadly), so anyone having access to CUDA already possess an x86 CPU :-)
There’s little point in kludging a parallel processor to emulate a sequential system. Much better to find the places in your x86 code where you are forcing the CPU to emulate a data-parallel processor through big for loops, SSE, or threads and offload those tasks to CUDA. :)
very interesting. thanks everybody and especially iAPX for the rundown.
but since this threads dates from early 2007, I was wondering if now would be a different story.
with the recent Geforce 295 GTX that is like 5.57 times faster than a i7 950 4x3.0Ghz according to some testing with pyrit
[url=“Google Code Archive - Long-term storage for Google Code Project Hosting.”]Google Code Archive - Long-term storage for Google Code Project Hosting.
could not we run a lot of x86 or sse2 instructions on a 295 GTX ?
I guess the RAM would still be the problem
what do you guys think?
The problem continues to be latency and bandwidth over the PCI-Express bus unless you load the entire process onto the GPU, in which case it is terribly inefficient.
The GPU is actually slower per operation than the CPU, but it is very wide. Even running single precision SSE instructions on the GPU would leave a one multiprocessor idle 87.5% of the time, and a GTX 295 has 2 GPUs with 30 multiprocessors on each GPU.
x86 emulation would not be very effective unless you could extract a huge amount of parallelism from the x86 binaries you were running, which just is not practical. It is far easier to specify the massive parallel operations at the source code level and compile down to CUDA, instead of trying to infer your way back up from machine instructions.
I’ve always thought that it would be funny to boot windows on a GPU. You absolutely could do it, given a massive engineering effort. Transmeta did something slightly less ambitious a while back (think of running an x86 build of windows on an Intel Itanium). The only benefit in my mind would be psychological, in helping to convince people that there really isn’t a fundamental difference between CPUs and GPUs, but there are probably more productive ways of doing that.
Hmm… Say we emulate x86 stuff… Say you use zero-copy to make the RAM look bigger… How would one expose the PCI bus, network cards and other devices? zero-copy supports that??? On the whole, It looks very confusing…
alternatively, I understand that the graphics card could not be used for the display, but is there anyway that you can enable the VM to access the GPU for numeric computation. i.e. is there any way I can run a Linux and Windows VM on the same machine via VMWare’s ESXi and have both VMs off loading calculations to the GPU?
alternatively, I understand that the graphics card could not be used for the display, but is there anyway that you can enable the VM to access the GPU for numeric computation. i.e. is there any way I can run a Linux and Windows VM on the same machine via VMWare’s ESXi and have both VMs off loading calculations to the GPU?
Not currently, but it seems it should be possible in theory, as there is such a thing possible for Quadro GPUs, where virtual machines can use a GPU in the host system. But there a GPU is assigned to a VM and it is not possible for two VMs to use the same GPU. Don’t remember the name of the tech, but it should be on the NVIDIA site.
Not currently, but it seems it should be possible in theory, as there is such a thing possible for Quadro GPUs, where virtual machines can use a GPU in the host system. But there a GPU is assigned to a VM and it is not possible for two VMs to use the same GPU. Don’t remember the name of the tech, but it should be on the NVIDIA site.
Interesting, I’ll have a look. Thanks
Interesting, I’ll have a look. Thanks