maybe this is a stupid question, but this topic is not very clear in my mind.
Usually code resides on the DRAM, which in our specific case can be partitioned on the device memory and the host memory.
In which of these two partition does the code reside?
What is actually happening when I make a kernel function call?
If a memory copy of the code (from device to host) needs to be done, does it add visible latency like a normal cudamemcopy?
So generally it’s a binary that’s loaded to the GPU upon the first invocation of the kernel. A .cubin file.
Sometimes this .cubin is JIT compiled at runtime from at PTX file (this will lead your application to take a lof time to invoke the first kernel call).
The upside of PTX is of course that it’s not compiled to a specific GPU.
This is the general idea, someone please correct me otherwhise :-)