Really Cool Idea Exploring possibilities of using the gpu as a virtual cpu core

Exploring possibilities of using the gpu(s) as a virtual cpu core(s).

What if we could create a virtual cpu device driver that fed to gpu through cuda, then windows would see it as simply another core to send tasks to. You could group different amounts of shaders to simulate different speeds depending on if you were doing something more cpu intensive than gpu intensive. Also you would have control over how many threads which would be really useful esp on lowerend systems. The only limitations i can see would be bus bandwidth and latencies combined with the overhead created having to translate everything into cuda.

Anyhoo thats the idea, unfortunatly I don’t have the skills to make it but I hope maybe someone out there has an idea.

Just imagine what this would do for people with dual and single cores, if they have nvidia they will be able to use their gpu for anything simply through native cpu thread requests.

Ok thoughts people :P

CUDOS is born. :)

GPU’s are designed to process massively parallel code, CPU’s as of yet are not. a GPU would be pretty rubbish at running an operating system desigend for a CPU, combined with the fact that you would have to emulate the x86 or x64 instruction set and the entire architecture of the CPU…

Agreed with Qazax. You should think of a single CUDA core as a sled dog, and a single CPU core as a horse. Individually, a CUDA core is no match for a CPU core, but if you chain up enough of them on a suitably huge workload, you get large throughput. CPU emulation is not a data parallel task, and a poor fit for GPUs.

It’s all about tradeoffs: GPUs are not magic speed machines. They achieve their incredible performance by focusing on a narrower workload, allowing them to allocate the die area differently than a CPU. Sadly, this makes the GPU a very poor platform for a lot of software out there, but an excellent platform for a small, but growing, problem domain.

Bad idea…

For sure you could use the GPU to speed up simulation of arrays of logic gates,

and therefore any kind of logic circuit - including simple CPUs.

Due to its higher memory bandwidth (some models have ~120GB/sec), the GPU

may achieve better performance in this area, compared to simulating the same thing

on a CPU.

However simulating something at the gate level usually means low performance.

It takes many clock cycles on the emulator to simulate a single clock cycle

of the emulated system.

For example:

Would it be feasible to simulate the MOS 6502 processor on a GPU in realtime?

5000 transistors, therefore somewhat less logic gates. Clock rate up to 14 MHz

(The Commodore 64 used ~1MHz). Would it be fun? Hell yeah. Would it be

worthwhile? Hell no.

Christian

Maybe the OP’s question is a little naive, but the direction that it is going is interesting. What would you think if it was rephrased as follows:

“Exploring possibilities of using the gpu to replace cpu core(s).”

“What would be necessary to design a usable system around a GPU only, with no CPU cores at all?”

Things like self-hoisting CUDA compilers and parallel operating systems come to mind. What do you think?

Along these lines, a crazy idea I like : running completely sequential apps on the whole GPU using value prediction (that is, guesses…)

Like at the end of this talk : http://www.irit.fr/Toulouse2009/Toulouse20…/Gaudiot_vp.pdf

Definitely not power-efficient, though. :)

I can’t really imagine a full blown OS being pulled by a dog sled (in case of modern GPUs, a rat sled with 10’000 rats). You really need horses, or at least donkeys.

It doesn’t mean the paradigm shift is fundamentally impossible, I just can’t personally imagine that. Maybe a light, power saving CPU plus a fast GPU with integrated fast RAM - that might do really well with working parallelizing compilers (something I have yet to see ;) ) or effective runtime parallelizing distributing-ILP-things (nigh-impossible with x86 and a sequential, imperative programming style).

There’s just too much sequential history in every OS and app to be able to port existing things effectively. And starting from scratch, developing massively parallel email clients and word processors is just too much effort for any single institution to undertake.

How about a rat sled with a few king rats with whips to direct the others?

You probably won’t see any of them until compilers become significantly smarter than people.

Agreed.

Maybe it would be too much effort for a single institution, but exactly how hard would it be, and how big of a performance difference between CPUs and GPUs would there need to be to make it worth it?

Could we get away with just doing the important apps? What if someone did the following:

    Write a binary translator along the lines of QEMU such that you could boot linux on a single EU and run at the speed of a 100-200Mhz CPU.

    Implement a cuda compiler in cuda.

    Add IO support for full-speed fast networking and disk.

    Implement some killer apps in cuda only (games, video codecs, google maps)

That is certainly a lot of work, but not impossible even for a small to medium sized company. Take a 10-20x performance hit on legacy applications (still run them), but offer a significant and growing advantage on apps that people actually care about.

And now we’re coming around to the Cell architecture. :) The Cell has a general purpose PowerPC core (the “PPE”) sitting on a ring bus with 8 SIMD cores (the “SPEs”). It is not hard to imagine applying a similar design to the current crop of CUDA GPUs.

Stuff an Atom x86_64 core with hyperthreading onto the die and give it direct access to the GPU memory bus, and you’d have a pretty awesome self-contained device. Similarly, I expect that AMD Fusion/Bulldozer/whatever you call it will look the same way. General purpose CPU core bolted to a bunch of SIMD-specialized cores.

Nice thought Seibert…

CPU cores like Intel are much much powerful than the ALU++ cores in GPU… So, it would be nice if a sequential CPU like Intel can auto-transform itself to an array of GPU like cores dynamically as instructed by software and then return back to CPU mode…

Its like a Whale transforming to an array of Sharks for effective hunting and then tranforming back to a Whale… (o…wat an analogy)

I think the answer to future computing is FPGAs (or similar) and especially active partial reconfiguration.
The device that could rebuild itself while running. Sort of self-modifying hardware. Need more FPU units? No problem! Trash those integer arithmetic circuits and use the space for an additional FPU.
But then again the most difficult thing is software: how do you compile for a computer with unknown internals?

Ahh, the science-fiction. Care to make it happen? ;)

There’s one big problem with all that we’re talking about here - backward compatibility. GPUs develop fast these days because they don’t have to be compatible with 20 years old stuff. If we do start running an OS on them they will eventually suffer from the “x86 curse”.

In 20 years time, GPUs will be a thing of the past or they will be called EXA-GPUs and will use EXA-CUDA with brand new ideas/framework/code/… - problem solved :)

Indeed very interesting times that we live in… :)

I would like to see some melding of ideas between current GPU architectures and these: http://ieeexplore.ieee.org/xpl/freeabs_all…rnumber=4443192

Reconfiguring an FPGA every other instruction may take a lot of time. Possibly, I don’t know FPGAs well.

About the x86 curse, it’s not just the particular instruction set. While x86 is ugly and stupid, the main problem is the style of programming that is sequential at its very core. You’d get similar problems with dynamically extracting massive instruction level parallelism if you were using MIPS or whatever, just perhaps without a few annoying technicalities. I mean, you can find ILP for like 4-5 pipelines? We need 4-5 thousands. You’d need a lookahead analyzer that can navigate the control flow to find a couple of thousands independent instructions and distribute them to cores. And ideally it should be able to do it within a single cycle, to saturate the ALUs…

As long as programmers write sequential programs, finding parallelism in their code is like trying to outsmart them. Doing the same at runtime, with compiled programs, is trying to outsmart the programmer and the sequential compiler.

IMHO the programmers need to change their paradigm first. Designing massively parallel hardware so that it works with naive sequential code is counterproductive and by definition ineffective, so let’s not get carried away with parallelizing tricks and instead find a way to train the next generation of coders to think horizontally :)

FPGA-based computing has been around for some time in commercial form, actually. I remember hearing about these guys years ago:

http://www.starbridgesystems.com/

Unfortunately, they appear to have gone bankrupt as their last news item is announcing a foreclosure sale. As with most radical changes to computer architecture, the limiting factor is language design and compiler technology. Hardware people can design all sorts of awesomely weird systems that will take you a decade to figure out how to program effectively. :)

Along with software complexity as seibert mentioned, competing with the exponential growth of Moore’s law is the main thing preventing the adoption of reconfigurable logic. Why would I need to be able to time multiplex multiple processors onto the same area when I can just wait a year and build a chip with all of them at the same time?

One other thing: Apparently standard single and double-precision floating-point operations eat up FPGA gates pretty fast. This also tends to discourage FPGA use in a lot of HPC sub-areas.