CUDA memory leak in sin / cos implementation (CUDA 3.0)? local memory not freed after kernel exits

Hi,

I suspect there might be a memory leak in the implementation of the sin / cos slow path (see programming guide p. 94). I noticed that a kernel making heavy use of sin / cos computations (in a inline device function) apparently leaks memory: after the kernel exits there are 15MiB more of memory in use than before. According to my understanding, that is never supposed to happen, i.e. if a kernel allocated local memory that should be freed when the kernel exits.

The kernel is automatically generated code and pretty long:

[attachment=23249:kernel.cu]

Compiling with --ptxas-options=-v gives this output:

ptxas info : Used 126 registers, 452+0 bytes lmem, 28+16 bytes smem, 144 bytes cmem[0], 108 bytes cmem[1]

Meaning it uses 452 Bytes of local memory. I can’t tell whether all of them come from sin / cos computations or from register spilling to local memory (which might happen since already 126 registers are in use).

When I compare global memory usage (using cuMemGetInfo) immediately before and after launching this kernel I see the following:

before: Free memory 4156686336 Byte (3964 MiB) / 4294770688 Byte (4095 MiB)

after: Free memory 4140957696 Byte (3949 MiB) / 4294770688 Byte (4095 MiB)

That is, 15728640 Bytes (15 MiB) are suddenly “missing”. This number is not an exact multiple of 452 (~35 000 times) and is independent of the problem size and the number of threads I launch the kernel for.

Profiling this kernel with cudaprof reports (128 blocks, 32 threads per block, ncells=131072):

local load 105340

local store 793488

The 15728640 missing Bytes are not a multiple of either of these numbers

For a similar kernel with the only difference that those 3 inline function evaluations are replaced by external parameters to the kernel, the issue does not appear (i.e. the memory usage immediately before and after invoking the kernel is identical as expected).

[attachment=23250:kernel_noinline.cu]

Compiling this kernel with --ptxas-options=-v gives this output:

ptxas info : Used 118 registers, 44+16 bytes smem, 52 bytes cmem[1]

Profiling the kernel shows no local loads or local stores (as expected).

The machine I used has the following specs:

    [*]Intel® Core™2 Duo CPU E8400 @ 3.00GHz

    [*]2GB RAM

    [*]Tesla C1060

    [*]Ubuntu 8.04, kernel 2.6.24-24-server #1 SMP Wed Apr 15 15:41:09 UTC 2009 x86_64 GNU/Linux

    [*]nvcc 3.0, V0.2.1221

    [*]gcc 4.2.4

I’m curious to hear whether anyone else had a similar experience.

Florian

does the leak persist after cudaThreadExit() - or when the application has terminated?

As far as I know kernels are cached to that successive calls have low latency. This cache might
have the same lifetime as the cuda context you’re in. Maybe this also includes any local memory
that is used for the kernel.

does the leak persist after cudaThreadExit() - or when the application has terminated?

As far as I know kernels are cached to that successive calls have low latency. This cache might
have the same lifetime as the cuda context you’re in. Maybe this also includes any local memory
that is used for the kernel.

The leak does not persist after the application has terminated (i.e. the result is perfectly reproducible with the exact same amount of initial free memory every time).

My findings are in accordance with your explanation. If this “missing” memory is cached by the CUDA runtime it would make sense that it is freed when the runtime is finalized, i.e. after the last call to cuMemGetInfo() that I can make. This explains why those 15MiB are still “missing” at the end of my program.

I just tried to run the kernel twice in a row and as expected from your explanation, there is no more memory allocated when the kernel is invoked the 2nd time.

This would in itself not be a problem in a normal use case. It becomes a problem however, if you launch a great number of different kernels with large input data in a benchmark setting. It seems that this cache memory over time clutters the global memory such that at some point no more memory allocations are possible. That’s what I saw: my benchmark crashed at some point because cudaMalloc returned a NULL pointer.

I think the caching should be smart enough in this case to free the cache in favor of an explicit malloc.

A comment from a developer on this issues would be great!

The leak does not persist after the application has terminated (i.e. the result is perfectly reproducible with the exact same amount of initial free memory every time).

My findings are in accordance with your explanation. If this “missing” memory is cached by the CUDA runtime it would make sense that it is freed when the runtime is finalized, i.e. after the last call to cuMemGetInfo() that I can make. This explains why those 15MiB are still “missing” at the end of my program.

I just tried to run the kernel twice in a row and as expected from your explanation, there is no more memory allocated when the kernel is invoked the 2nd time.

This would in itself not be a problem in a normal use case. It becomes a problem however, if you launch a great number of different kernels with large input data in a benchmark setting. It seems that this cache memory over time clutters the global memory such that at some point no more memory allocations are possible. That’s what I saw: my benchmark crashed at some point because cudaMalloc returned a NULL pointer.

I think the caching should be smart enough in this case to free the cache in favor of an explicit malloc.

A comment from a developer on this issues would be great!