When are kernels loaded on to the device mem? Any CUDA Guru knows this

Noel_Lopes · July 2, 2009, 9:47pm

Does anyone know the answer to this questions:

When are kernels loaded on to the device memory? Are they loaded as soon as the host program starts or are they loaded only when invoked?

Already post this on the forum CUDA Programming and Development, but nobody seems to be able to answer this.

tmurray · July 2, 2009, 10:10pm

They’re loaded at module load time. If you’re using CUDART, that’s when the context is created.

Sylvain_Collange · July 3, 2009, 11:57am

Did this behavior change since CUDA 1.1?

My experiments with renouveau at that time where suggesting that it was only loaded after the cuLaunchGrid call.

Anyway, this means that if I use only one kernel in a module containing a few hundred kernels (typical of libraries using templates like CUBLAS, CUDPP…), all kernels will be loaded to device memory?

(I don’t say it’s a bad thing, just asking…)

Noel_Lopes · July 3, 2009, 5:36pm

I sure hope kernels are loaded only when they are called (in runtime). This is what makes sense for me.

However I would like to know this for sure.

Anyway, thank you Sylvain Collange and tmurray

Sylvain_Collange · July 6, 2009, 8:44am

Since API calls are asynchronous, it may also make sense to start loading the kernel(s) as soon as possible to overlap the initialization phases.

Even loading more kernels than strictly necessary might not cause performance degradation. Actually I was hoping an answer from Tim along the lines of : “Benchmarking shows that sending 100K through the PCIe is only marginally slower than sending 1K, and always much faster than sending 100 times 1K, so we decided to aggressively prefetch all kernels in advance.”

I guess I’ll never know. ;)

(Well, it’s certainly more complicated than that, because each kernel probably needs to be aligned on a 4K-page boundary…)

tmurray · July 6, 2009, 4:55pm

Well, that’s obviously true–there’s some driver overhead associated with memory allocation, etc., so doing it all in one go instead of X times is certainly faster.

Topic		Replies	Views
when are kernels loaded on to the device mem? CUDA Programming and Performance	1	850	July 4, 2009
Loading Kernel Code to Device Point of Time when Kernel Code will be loaded to the Device CUDA Programming and Performance	0	1872	September 28, 2010
Loading Kernel Code to Device Point of Time when Kernel Code will be loaded to the Device CUDA Programming and Performance	0	1992	September 28, 2010
When exactly is kernel code transferred to GPU? CUDA Programming and Performance	3	3305	January 2, 2012
free kernel code after execution CUDA Programming and Performance	8	4777	June 23, 2012
Do more parameters passed to kernel make it slower? CUDA Programming and Performance	9	6410	December 17, 2009
IO and Execution Pipelining CUDA Programming and Performance	7	4883	August 3, 2007
CudaMemset internally calls a kernel? CUDA Programming and Performance cuda , kernel , cuda-gdb	2	1034	April 3, 2023
CUDA kernels consuming device memory? CUDA Programming and Performance	3	3893	December 9, 2010
Host to Device memcpy overhead CUDA Programming and Performance	2	1149	March 17, 2009

When are kernels loaded on to the device mem? Any CUDA Guru knows this

Related topics