Can anyone list specifically what is captured in a CUcontext? I can see from examples that CUmodule and CUfunction can be bound to a CUcontext. Is there a complete list somewhere?
Also, what is the cost of performing a cuModuleLoad() and cuModuleGetFunction() inside a new context (created by cuCtxCreate) for every invocation of cuLaunchGrid()?
From what I can gather, CUDA loads the module from the .cubin and finds the kernel entry point. The former is probably quite slow while the latter I assume to be fast once the module is loaded. Would all of this overhead swamp a typical kernel execution initiated by cuLaunchGrid?
I’m only looking for a rough comparison here (i.e., depends on .cubin size, but does loading a module and looking up the function entry point take on the order of milliseconds)?
I’m afraid that having to perform the module load/unload and function entry lookup on each kernel invocation will harm latency, although if the kernel itself takes several milliseconds to execute I may be able to afford a few tens of microseconds.
Thanks for any insight…