EXE files? Just what it says

Can .exe programs run in the CUDA architecture? If not, does any program exist that will turn exe files into CUDA files? I have an program for chess endgame database generation that only comes in .exe format, and I feel that it could benefit greatly from GPU computing. I would like to know if there is any way to run a .exe file in CUDA. (It’s a command-line program, if that helps)

No.

CUDA uses a different architecture than your standard CPU, so you can’t just translate the code over…you need to rethink your algorithms to work in a parallel nature, rather than the EXE file’s code which executes serially.

No… as ‘profquail’ has rightly pointed out, you cannot just convert any .exe to make it work on GPU. You have to modify the source code (in accordance to the CUDA environment) and compile it to generate a CUDA compatible excutable.

Total non-sequitor…

I was expecting Intel’s x86 Larabee to let you do this. But, nope, not a chance.

Theoretically, you may write a x86 virtual machine on top of CUDA :-)
Just kidding :-)

Maybe that could work? You transform os threads into cuda threads (you could do this, couldn’t you? you’ve got the primitives like atomics and barriers…) and run a process per multiprocessor. (Then, eg, run mpi between processes.)

I think it could totally be done and could work well, if the threads in the original prog aren’t too divergent.

The problem would be porting all the libraries.

You could implement all the x64 registers within 128 CUDA regs (You need: 32 DWORDS of ints, 64 DWORDS of SSE, 20 DWORDS of x87, with 21 DWORDS for anything else). Then, just interpret the x86 instruction set literally, without using rename registers or any of that Pentium mumbo-jumbo. You could do it, since you don’t have to worry about the huge DRAM latency these instructions normally incur. PLUS, you could map the “registers+stack” paradigm in x86 (if you know what i’m talking about, you know who you are) efficiently to the shared memory!

I’m not sure how great the performance would be without a cache. But if each thread reads from thread-local memory, all dram accesses would automatically be coallesced. I don’t think it’d be too bad. Remember, a GPU’s bandwidth to its DRAM is about as much as a CPU’s to its L1. Across gigabytes.