My issue is basically as follows: I have got a large set of code I want to execute. As many parts of this code can be executed in a parallel fashion I want to use CUDA for it. The problem hereby is that the code is too large to be executed directly, plus I can only run one kernel per GPU.
The idea I had was to use a small ‘JIT VM’ as kernel, which would then execute this code. Sounds crazy enough, right?
First of all, would this be practical? If yes, what would be the best way to implement it? I have experience with VMs on CPUs, whereby bytecode or similar (ASTs) is processed by the JIT and executed by small bits of ASM matching the underlying architecture. What would be the chance of implementing a similar strategy in a CUDA kernel? Also, should I execute PTX code, or whatever ASM format the GPU uses? That is, what should I let nvcc output, and is there a fixed ASM format for the GPU architecture, or will I have to work around this?
PTX to NVIDIA GPU is not the same as assembly to CPUs, it is pseudo-assembly. Underlying (real) assembly language is not public (there is decuda, however).
I’m not sure if VM on GPU sounds practical (I would say it is not), but it may be an interesting project anyway.
My colleague and I have been thinking some more on this issue, and have come up with a different approach. Yes, it still involves a kind of VM, but no longer requires assembly language or such. This new approach should also be a lot easier to implement not to mention faster.
Yes, the reason why we wanted to use CUDA was because it consists out of many instruction streams running in parallel. A VM is the only way to implement the CUDA as different operations are used with particular sets of data.