Computed gotos / interpreter design

raph38130 · January 16, 2025, 10:06am

Hi All,

an alternative approach to solve issues such shown in previous question, aka cuda code compilation time of huge kernel taking ages, would be to implement some kind of byte code interpreter in order to bypass compilation stage.

most efficient implementations of bytecode interpreters relies on computed gotos/threaded code (a la Forth) but unfortunately nvcc does not support such feature in kernel code
error: address of label extension is not supported in __device__/__global__ functions

are you aware of workaround/alternative design ?
thx

njuffa · January 16, 2025, 10:51am

I would suggest performing a literature search. Here is one relevant publication:

Ahmet Celik, Pengyu Nie, Christopher J. Rossbach, Milos Gligoric, “Design, Implementation, and Application of GPU-Based Java Bytecode Interpreters.” Proc. ACM Program. Lang. 3, OOPSLA, Article 177 (October 2019), 28 pages. https://doi.org/10.1145/3360603

I have not read the above paper. My initial reaction was that bytecode interpreters are a poor match for GPUs, but apparently people who have put more thought into this arrived at a different conclusion.

hugo32 · January 16, 2025, 3:13pm

having done extensive research on how to implement a bytecode interpreter in CUDA, as that is my work project for the past 2 years, doing computed gotos or threading is not feasible in current nvcc, even when/if abusing the brx.idx ptx instruction. Simply do normal interpreter loop, possibly macroops etc, and try putting your code in the fastest possible memory (constant is nice, shared is even nicer and you can do manual paging in/out of that if needed)

Curefab · January 17, 2025, 10:21am

Do you want to interpret the same bytecode program with different data? (How many? For parallelization.) Or each computation has a different bytecode program?

raph38130 · January 17, 2025, 10:34am

yes we have to compute multiple dataset with the same byte code. each dataset differs from each other by few bits

Curefab · January 17, 2025, 10:51am

That is good, then you do not have warp divergence. Think about a language, which can be turned into bytecode as effectively as possible. Memory accesses (temporary memory, global memory), computations. Try to do as much branch-free code as possible and stream the bytecode into shared memory.

raph38130 · January 17, 2025, 10:53am

thank you all for your advices. is your work @hugo32 on this topic (bytecode interpreter) publicly available ?

hugo32 · January 17, 2025, 12:00pm

@raph38130 unfortunately it’s not publicly available, but i can describe some of it
we have a stack based bytecode interpreter from a public spec, are in a more general setting than (1 code, N data), indeed we support (N code, M data, + runtime code modification), but are heavily optimised for the same-code situation. We keep code in constant mem for good caching performance in the dispatch loop, with dynamic moving out to code in generic global memory when code size is too large. This gets you to 1-10M op/s due to branch resolution & gmem latency, and our opcodes are heavy enough that it’s mostly access to the stack for each thread that bottlenecks. For that we do windowed stack caching in shmem, adapted after benchmarking to about 8 items per thread (thus push/pop doesn’t incur gmem latency until the stack height has increased by 4, at which point we slide the window 4 items up, etc. We use all available dynamic shared memory for this), which lets us achieve ~80% memory & compute utilisation (as per nsight profiling). also important to consider coalescing of memory access across threads, for which we have a bunch of SoA adapted data structures. There is a 2009 paper from some american (midwest?) university which is the earlier record i’ve found of people doing gpu interpreter-ish things. I was originally inspired by Massively Parallel Rendering of Complex Closed-Form Implicit Surfaces in how it uses a tape for each pixel.

Topic		Replies	Views
Simple tab construction with CUDA runs even slower than the non-parallel cpu version CUDA Programming and Performance	5	390	November 25, 2020
A different way to think about writing GPU applications CUDA Programming and Performance	18	3624	January 4, 2013
32-bit nvcc makes faster GPU code than 64-bit variant In CUDA version 2.1 CUDA Programming and Performance	9	10469	February 14, 2009
CUDA Parallel BFSs over a c++ vector implemented graph CUDA Programming and Performance	6	745	December 30, 2019
Parallel Reduction CUDA Programming and Performance	10	3663	June 26, 2011
CUDA OpenCV questions CUDA Programming and Performance	7	2464	November 30, 2010
Suggestion to decrease compilation time CUDA Programming and Performance	7	22	January 17, 2025
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	168	July 7, 2024
NVCC error when trying to compile FFMPEG with --enable-cuda-nvcc flag CUDA Setup and Installation ffmpeg , nvcc	5	10164	June 8, 2024
nvcc vs. gcc compilation CUDA Programming and Performance	3	4925	March 23, 2010

Computed gotos / interpreter design

Related topics