Advice for constant ptx->cubin compilation, loading, running

I want to compile and run a constant stream of cuda kernels that are being written directly to ptx. I plan on using the nvPtxCompiler with openMP to fully utilize all cpu cores to compile each kernel independently to cubins keeping the code and results in cpu ram. Each kernel will only run once. So load, run, unload…

I’d like advice on managing the loading, running and unloading of the modules. I’ve been reading around and some posts from 2018 say that all current kernels executing on a gpu context will stall waiting for a cuModuleLoadDataEx. I’ve also read that it’s best to use nvJitLink to merge a bunch of the cubins from multiple kernels in to one cubin before doing a cuModuleLoadDataEx as it’s serialized against the context.

For someone familiar with these systems what should I try to get the most occupancy out of this whole workflow? Would dual contexts be ideal in a double buffer style? I’m trying to focus on consumer cards like my 4090. (I read that MIG can mitigate some of these issues, which I won’t have access to).

Thanks!

1 Like

What are typical running times and compilation times for your setup per kernel? Do kernels strictly have to be serialized?

I’d like to be able to minimize the amount of SMs each kernel occupies and the amount of time each kernel runs while keeping it in balance with the compile and load. I’m using cutlass/cute simt f32 kernels as a benchmark and those take 50-70ms to go from ptx to sass per cpu core. I’ll have more SMs than cpu cores so 50 / 48 cores = x / 128 SMs would be the runtime necessary to balance with 1 kernel per SM. 50/48 = x / 64 would be to balance with 1 kernel per 2 SMs etc… I can scale the kernel sizes/SMs up to eventually balance with the module load but I’m not sure how to handle loading in the best way.