My issue is basically as follows: I have got a large set of code I want to execute. As many parts of this code can be executed in a parallel fashion I want to use CUDA for it. The problem hereby is that the code is too large to be executed directly, plus I can only run one kernel per GPU.
The idea I had was to use a small ‘JIT VM’ as kernel, which would then execute this code. Sounds crazy enough, right?
First of all, would this be practical? If yes, what would be the best way to implement it? I have experience with VMs on CPUs, whereby bytecode or similar (ASTs) is processed by the JIT and executed by small bits of ASM matching the underlying architecture. What would be the chance of implementing a similar strategy in a CUDA kernel? Also, should I execute PTX code, or whatever ASM format the GPU uses? That is, what should I let nvcc output, and is there a fixed ASM format for the GPU architecture, or will I have to work around this?
PTX to NVIDIA GPU is not the same as assembly to CPUs, it is pseudo-assembly. Underlying (real) assembly language is not public (there is decuda, however).
I’m not sure if VM on GPU sounds practical (I would say it is not), but it may be an interesting project anyway.
Recently, I was involved in a porting effort. I would NOT call it a BIG project.
But the thing is, you should take a portion of the code that can be parallelized and use CUDA to speed it up and so on…
You cannot run your application directly on the GPU anyway. So, take a section parallelize it and then CUDAize that section and run it.
At this point, I am doubting if you are looking @ CUDA from 15000 ft and making some decisions… Do some small stuff, gain a foothold and re-think your decision.
My colleague and I have been thinking some more on this issue, and have come up with a different approach. Yes, it still involves a kind of VM, but no longer requires assembly language or such. This new approach should also be a lot easier to implement not to mention faster.
Note that if you implement a VM Kernel in GPU – This kernel has to run parallel to get speedups in the first case…
Which means – Each thread of your VM Kernel should possibly running an instruction stream…
If you have multiple instruction streams in your kernel – each doing a different computation – then this would make sense… Otherwise, it would make sense to run the same on CUDA directly.
Yes, the reason why we wanted to use CUDA was because it consists out of many instruction streams running in parallel. A VM is the only way to implement the CUDA as different operations are used with particular sets of data.