Extending CUDA API Allocating "used" host memory

Hi there,

imagine someone’s missing a certain functionallity in CUDA.
Presuming good programming skills, how would one implement… let’s say an extension for CUDA by either extending the CUDA API or writing an additional, rather small API.

Thanks for dropping me your thoughts and opinions,

This would depend on whether or not you need access to the internal data structures used by the runtime. For example, if you need to get access to the allocated memory maps on the device, then you would have to reimplement either the runtime API or the driver API. If you need some other functionality that does not require interacting with the functionality in the current API, then it would be much simpler to just create a library of your own. If you want to see what it would take to implement the Cuda API from scratch, you could take a look at some of our code here:






You might want to take a look at lines 1614-1643. We added two additional API calls to allow trace generators to be bound to kernels as they are launched.