free kernel code after execution

Dear all,

I am hunting a (device) memory leak in my application. I use the runtime API and cudaGetMemInfo(free,total) to measure device memory usage. I notice a 31M loss after kernel execution. The kernel code itself does not allocate any device memory.

When is the kernel code loaded into device memory? I guess at execution of the host code line:
Does the code stay in device memory after the call? If so, can I explicitly unload the code?

I use CUDA 4.0 Linux 64bit, NVIDIA UNIX x86_64 Kernel Module 270.41.19, on a GeForce GTX 480



Can you show the piece of code governing the use of kernel?

Typically, the call of a kernel looks like:

// allocate and copy memory on GPU memory

     cudaMalloc (...)

     cudaMemcpy (..., cudaMemcpyHostToDevice)

// call kernel

     .... <<< ... >>>

// free GPU memory

    // Normally you release everything you have allocated just before with cudaMalloc

    cudaFree (...)

I just reread the post …

It doesn’t seem that the code loaded into memory must be managed manually but anyway 31M of code was a bit much (or you have a LOT of code) …

I didn’t understand what you mean with “must not” be managed manually. Is there a way to free the device code after kernel execution?

Here is the code section:

extern "C" void function_host(kernel_args_t * kernel_args,kernel_geom_t * kernel_geom)


  cudaError_t ret;

dim3  blocksPerGrid( kernel_geom->Nblock_x , kernel_geom->Nblock_y , 1 );

  dim3  threadsPerBlock( kernel_geom->threads_per_block , 1, 1);

kernel<<< blocksPerGrid , threadsPerBlock >>>( kernel_args );

cudaError_t kernel_call = cudaGetLastError();

if (kernel_call != cudaSuccess)



I wanted to say that there is no (I think) memory device code management (she’s handled automatically).

If you correctly free your “kernel_args”, I don’t see why he’ll have a memory leak …

PS : if you launch several times your kernel, is that the memory leak increases with each call ?

Maybe its not a “leak”. All I am saying is that the device memory available is smaller after the kernel call compared to before the kernel call.

I just run the program again. This times calling the kernel twice. Result: The same amount of available device memory is missing:
Before 1st call:
CUDA reports 1382629376(1609760768) bytes free
After 1st call:
CUDA reports 1352073216(1609760768) bytes free
After 2nd call:
CUDA reports 1352073216(1609760768) bytes free

I have a very vague idea of what might be happening:
At the 1st call the cuda context enlarges its scratch space for kernel code so that the kernel fits in.
This can be seen by what CUDA getMemInfo reports. Namely that the available space is smaller now.
The 2nd call does not enlarge the scratch space since the kernel code is already cached.

Can someone confirm that this is the method CUDA uses?!

If so, this has a dangerous implication: What if I allocate device memory and it happens to be right behind the CUDA context’s scratch space and CUDA must increase the scratch space (due to an unloaded kernel call) but can’t because of my previous allocation ??

That can’t happen in the runtime API. You can’t call cudaMalloc until there is an active context on the device, and all of the operations which allocate heap/scratch/local memory happen during context establishment and module loading. Note that in your example “before 1st call” is giving you the available memory before context establishment. After the context is established, the available memory will only change if you either allocate it yourself, or call one of the API operations which changes a functions heap space allocation.

Good morning,

I am really sorry for digging such old posts but I am interested in learning what happens during initialization of the GPU. What actions do take place during initialization? Does this initialization takes place when you call a kernel from host for the first time? Does it take place when you call any CUDA function (e.g. cudaMalloc, cudaSetDevice etc) for the first time? If, for example, you had to call kernel A and after kernel A finished, you had to call kernel B. Would the context be established twice or just once (during the call of kernel A)?

Last, but not least, what exactly do you mean when you say “module loading”?

Thanks beforehand.

In the runtime API, device context initialization takes place at the first CUDA function that requires a context. If by that point you have not set a device, the default device will be used. Once the context is created, it is used for the remainder of the program unless you specifically do something to change that.

There is additional overhead the first time you call a CUDA kernel. The kernel might needs some JIT recompilation (depending on what architecture it was compiled for) before it can be sent to the device, so in general you will notice the first call to a kernel takes longer than subsequent calls.

So, if you had a program that did this:

cudaMalloc() an array called "data"



The CUDA context will be created when cudaMalloc() is called, and then it will be used by kernelA and kernelB.

The CUDA Driver does numerous lazy allocations for local memory, stack memory, constant memory, instruction RAM, malloc heap, printf heap, … In addition the CUDA Driver has a block memory allocator that may allocate more memory than is requested by the user. cudaGetMemInfo should not be used to find memory leaks. On some OS this value can be affected by other processes.