Advice for constant ptx->cubin compilation, loading, running

cpdurham · September 14, 2025, 9:55pm

I want to compile and run a constant stream of cuda kernels that are being written directly to ptx. I plan on using the nvPtxCompiler with openMP to fully utilize all cpu cores to compile each kernel independently to cubins keeping the code and results in cpu ram. Each kernel will only run once. So load, run, unload…

I’d like advice on managing the loading, running and unloading of the modules. I’ve been reading around and some posts from 2018 say that all current kernels executing on a gpu context will stall waiting for a cuModuleLoadDataEx. I’ve also read that it’s best to use nvJitLink to merge a bunch of the cubins from multiple kernels in to one cubin before doing a cuModuleLoadDataEx as it’s serialized against the context.

For someone familiar with these systems what should I try to get the most occupancy out of this whole workflow? Would dual contexts be ideal in a double buffer style? I’m trying to focus on consumer cards like my 4090. (I read that MIG can mitigate some of these issues, which I won’t have access to).

Thanks!

Curefab · September 15, 2025, 9:01am

What are typical running times and compilation times for your setup per kernel? Do kernels strictly have to be serialized?

cpdurham · September 15, 2025, 11:46am

I’d like to be able to minimize the amount of SMs each kernel occupies and the amount of time each kernel runs while keeping it in balance with the compile and load. I’m using cutlass/cute simt f32 kernels as a benchmark and those take 50-70ms to go from ptx to sass per cpu core. I’ll have more SMs than cpu cores so 50 / 48 cores = x / 128 SMs would be the runtime necessary to balance with 1 kernel per SM. 50/48 = x / 64 would be to balance with 1 kernel per 2 SMs etc… I can scale the kernel sizes/SMs up to eventually balance with the module load but I’m not sure how to handle loading in the best way.

Topic		Replies	Views
Driver API: PTX or CUBIN modules? CUDA Programming and Performance	3	2525	July 9, 2009
Multiple ptx/cubin files merge and calls CUDA Programming and Performance	2	1663	December 16, 2020
Calling cuModuleLoad while another kernel is running CUDA Programming and Performance	7	1402	July 30, 2024
cuModuleLoad/cuasm CUDA Programming and Performance	0	3863	February 24, 2008
PTX jit spills registers in trivial programs CUDA Programming and Performance	9	1055	February 28, 2024
Using driver API to launch kernels A way to avoid external cubins? CUDA Programming and Performance	7	10682	August 14, 2008
what does cuModuleLoad do? CUDA Programming and Performance	1	3087	January 21, 2020
--cubin is not allowed when compiling for multiple GPU code instances CUDA Programming and Performance	4	2245	April 4, 2013
How to invoke ptx kernel from host code CUDA Programming and Performance	3	1450	November 6, 2016
Is it possible to load a module from two different cubin files? CUDA Programming and Performance	4	3141	May 27, 2009

Advice for constant ptx->cubin compilation, loading, running

Related topics