What is captured in CUcontext? contexts, cost of module load

Can anyone list specifically what is captured in a CUcontext? I can see from examples that CUmodule and CUfunction can be bound to a CUcontext. Is there a complete list somewhere?

Also, what is the cost of performing a cuModuleLoad() and cuModuleGetFunction() inside a new context (created by cuCtxCreate) for every invocation of cuLaunchGrid()?

From what I can gather, CUDA loads the module from the .cubin and finds the kernel entry point. The former is probably quite slow while the latter I assume to be fast once the module is loaded. Would all of this overhead swamp a typical kernel execution initiated by cuLaunchGrid?

I’m only looking for a rough comparison here (i.e., depends on .cubin size, but does loading a module and looking up the function entry point take on the order of milliseconds)?

I’m afraid that having to perform the module load/unload and function entry lookup on each kernel invocation will harm latency, although if the kernel itself takes several milliseconds to execute I may be able to afford a few tens of microseconds.

Thanks for any insight…