I want to compile and run a constant stream of cuda kernels that are being written directly to ptx. I plan on using the nvPtxCompiler with openMP to fully utilize all cpu cores to compile each kernel independently to cubins keeping the code and results in cpu ram. Each kernel will only run once. So load, run, unload…
I’d like advice on managing the loading, running and unloading of the modules. I’ve been reading around and some posts from 2018 say that all current kernels executing on a gpu context will stall waiting for a cuModuleLoadDataEx. I’ve also read that it’s best to use nvJitLink to merge a bunch of the cubins from multiple kernels in to one cubin before doing a cuModuleLoadDataEx as it’s serialized against the context.
For someone familiar with these systems what should I try to get the most occupancy out of this whole workflow? Would dual contexts be ideal in a double buffer style? I’m trying to focus on consumer cards like my 4090. (I read that MIG can mitigate some of these issues, which I won’t have access to).
Thanks!