an alternative approach to solve issues such shown in previous question, aka cuda code compilation time of huge kernel taking ages, would be to implement some kind of byte code interpreter in order to bypass compilation stage.
most efficient implementations of bytecode interpreters relies on computed gotos/threaded code (a la Forth) but unfortunately nvcc does not support such feature in kernel code error: address of label extension is not supported in __device__/__global__ functions
are you aware of workaround/alternative design ?
thx
I would suggest performing a literature search. Here is one relevant publication:
Ahmet Celik, Pengyu Nie, Christopher J. Rossbach, Milos Gligoric, “Design, Implementation, and Application of GPU-Based Java Bytecode Interpreters.” Proc. ACM Program. Lang. 3, OOPSLA, Article 177 (October 2019), 28 pages. https://doi.org/10.1145/3360603
I have not read the above paper. My initial reaction was that bytecode interpreters are a poor match for GPUs, but apparently people who have put more thought into this arrived at a different conclusion.
having done extensive research on how to implement a bytecode interpreter in CUDA, as that is my work project for the past 2 years, doing computed gotos or threading is not feasible in current nvcc, even when/if abusing the brx.idx ptx instruction. Simply do normal interpreter loop, possibly macroops etc, and try putting your code in the fastest possible memory (constant is nice, shared is even nicer and you can do manual paging in/out of that if needed)
Do you want to interpret the same bytecode program with different data? (How many? For parallelization.) Or each computation has a different bytecode program?
That is good, then you do not have warp divergence. Think about a language, which can be turned into bytecode as effectively as possible. Memory accesses (temporary memory, global memory), computations. Try to do as much branch-free code as possible and stream the bytecode into shared memory.
@raph38130 unfortunately it’s not publicly available, but i can describe some of it
we have a stack based bytecode interpreter from a public spec, are in a more general setting than (1 code, N data), indeed we support (N code, M data, + runtime code modification), but are heavily optimised for the same-code situation. We keep code in constant mem for good caching performance in the dispatch loop, with dynamic moving out to code in generic global memory when code size is too large. This gets you to 1-10M op/s due to branch resolution & gmem latency, and our opcodes are heavy enough that it’s mostly access to the stack for each thread that bottlenecks. For that we do windowed stack caching in shmem, adapted after benchmarking to about 8 items per thread (thus push/pop doesn’t incur gmem latency until the stack height has increased by 4, at which point we slide the window 4 items up, etc. We use all available dynamic shared memory for this), which lets us achieve ~80% memory & compute utilisation (as per nsight profiling). also important to consider coalescing of memory access across threads, for which we have a bunch of SoA adapted data structures. There is a 2009 paper from some american (midwest?) university which is the earlier record i’ve found of people doing gpu interpreter-ish things. I was originally inspired by Massively Parallel Rendering of Complex Closed-Form Implicit Surfaces in how it uses a tape for each pixel.