The 1.0 release of CUDA has functions for execution management:
cudaLaunch, cudaConfigureCall, etc.
But, there are no examples on how these can be used. Does anyone have a simple example for how this can be done. Also, there isn’t any module management functions documented in the runtime API. If you can’t load a module, how will the execution management function even know about it?
The simpleTextureDrv example uses the low-level API (cuda.h). The documentation for execution/module management for that low-level API is pretty good and as you point out, the examples shows how to do everything.
But, I am using (and would like to continue using) the high-level API, which has little documentation and no examples. Has someone figured out how to do the equivalent with the high-level API.
Well, I’d recommend to use the low level API (in case you’re writing a compiler) or let nvcc generate the high level APIs.
The low level API is sometimes 20us faster per launch in my test.
At this point, this is not an option for me - I have wrapped the entire high-level API into Python. The goal in my case is to make it as easy as possible to use the CUDA libraries and GPU from Python. The low-level API is counterproductive in that respect. At some point in the future I do plan on wrapping the low-level API as well, but that is a lower priority.
But, the question, still remains: the high-level API does have functions that would seem to do the job, but how do you use them?
Your statement seem to imply that you have been able to use the high-level API for execution management. Could you post a simple example?
I haven’t used high-level API myself without nvcc. But I read a nvcc-generated program’s disassembly while translating my entire program into using the low-level API.
Basically, you could use nvcc to compile a simple program using --save-temps and read its output to figure out how to use those cudaxxxx functions.
Besides, if you’re not using nvcc, I really doubt the whether the high-level API would be any easier. Maybe wrapping the low-levels would save you some trouble later. Your case is pretty much like writing a non-nvcc compiler.