I’m designing a log and replay system where I need to capture all function calls to the CUDA library and then replay them. For this, I’m explicitly calling internal CUDA APIs, like __cudaRegisterFatBinary and __cudaRegisterFunction.
I have two questions:
Is there any documentation available for these internal functions? I checked the CUDA documentation page but couldn’t find anything on them.
When replaying, if we call __cudaRegisterFatBinary with the same fatcubinHandle as during the logging phase, does it return the same value as it did during logging, or does it generate a different handle each time, regardless of the FatCubin value?
If NVIDIA had intended these functions to be called by programmers, they would have made them public APIs and would have documented them in CUDA documentation. The two leading underscores in the symbol names make it clear that they are for internal use by the system only.
Any mechanism internal to some software should be treated as a programming artifact: Its exact behavior is not documented and not guaranteed by the software provider, and it could change at any time without notice.
Any code relying on artifacts in 3rd party software is therefore brittle, i.e. easily broken. That does not mean it could not work for some unknown amount of time if behavior of the 3rd party software is characterized sufficiently via reverse engineering. There is a long tradition of tools taking such an approach. They are however, a royal pain to maintain (been there, done that, got the t-shirt), may never work 100% reliably, and at some time may just stop working entirely.
There isn’t any documentation available for them, and I would be fired for violating company policy if I released material non-public information about them. You’re welcome to ask whatever questions you wish, of course, but I doubt you’ll get any authoritative answers from NVIDIA.
Ok. So here is the modified question. If am logging and replaying the cuda function calls. Can I use the old registering values like fatCubinhandle at the time of replay or do I have to re-reregister?
I don’t know if I will be able to answer or not, and I don’t know if the Runtime or Driver API are documented enough to discuss how to “replay” them, whatever that may mean.
Which calls specifically? Can you give an example?
In general, a handle is an opaque object. If you get a handle (i.e. a specific numerical value or bit pattern) corresponding to an API that returns a handle, such as cudaStreamCreate(), then I would not assume that if I was going to run that call again (“replay”) in a different instance of an application or a different application, that I could use or would get the same numerical value/bit pattern. Likewise, in a subsequent call that used that handle, called again (“replayed”) in another application setting, I would not assume that the handle numerical value returned in application instance A could be used in a different application instance B.
cudaStreamCreate(&s); // stores some numerical value like 12345 in `s`
cudaStreamSynchronize(s); // uses the numerical value 12345 to identify a stream
In the above example, if I ran the application and logged the value 12345 as the actual value stored in the handle s, and I desired to “replay” that sequence of calls in another application based on a recorded log of the sequence of calls, then I would not assume that 12345 had any meaning or relevance anywhere in the other/“replay” application setting.