How to find leaks? cuda-gdb runs out of memory, but compute-sanitizer runs without erros


I am running some tests on my code on a cluster. Something peculiar happens.
When I run the code with cuda-gdb the memory usage increases continuously and eventually it runs out of memory and crashes.
My first thought was that I have some gpu variables which are not deallocated so I ran the code with cuda-memcheck and compute-sanitizer, but no errors were reported. The memory usage was below some threshold.
In both cases I compiled the cuda part with -G argument.
I did also a manual check. I used cat <code> | grep malloc and matched them with the out put of cat <code> | grep Free.

What else can I do to find the bugs? Is it possible there is a problem with running cuda-gdb on clusters?.

Best regsards,


If you are talking about host memory leaks, none of the tools you suggest are designed to find those. A host memory leak can cause a program crash (the os terminating the program for you automatically). A device memory leak could only lead to an eventual runtime error, not a crash as I have defined it. If you are chasing a host memory leak, this is probably the wrong forum to be asking about it.

compute-sanitizer (and cuda-memcheck) both have leak-checking capability for device memory. If you haven’t enabled those features, try that.

If you think there is a problem with cuda-gdb, I would suggest upgrading to the latest version. If the problem still exists, then I suggest reporting it on the cuda-gdb forum, not here. I also suggest asking questions about cuda-gdb specifically, there not here.

As a final suggestion, to chase a difficult to find bug there are many “typical” debugging techniques. A common one is to progressively strip down your code, removing parts. As you run the shrinking code, check for memory leaks. When the memory leaks stop, then study what you most recently removed.


Thank you for your reply. Sorry for the misunderstanding. In the original message I was referring only to gpu memory.
The gpu memory usage is increasing linearly while running my code with cuda-gdb. When the GPU memory is filled the code crashes.
I am running my code on a cluster and upgrading is not an option.

I am using version 11.5.

It is weird. I do not understand why running my code with cuda-gdb the gpu memory usage is constantly increasing, while running normally or inside compute-sanitizer the amount of memory is below some threshold (going up and down as the code allocates and deallocates the memory for the various variables). It should happen in both cases, if allocations of device memory using cudaMalloc() that have not been freed

I realized only now (though spent some time digging) that the flag --leak-check full is needed to check the memory leaks caused by cudaMalloc. I got this summary from cuda-memcheck --leak-cheak full

========= This tool is deprecated and will be removed in a future release of the CUDA toolkit
========= Please use the compute-sanitizer tool as a drop-in replacement
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 0 errors

So something really weird happens

Hi @cristian.v.achim
It’s expected for cuda-gdb to use a limited amount of GPU memory, but what you are seeing (debugger consuming all GPU memory) might be a bug in cuda-gdb.

Could you share additional details about the issue:

  • cuda-gdb output when debugging your application
  • nvidia-smi output


Here is the output from the last run:

This is the cuda-gdb output:

warning: Cuda API error detected: cudaMallocAsync returned (0x2)

GPUassert: out of memory src/ 56 // here I have  an allocation cudaMallocAsync( a_d,  Np ,0); // Np is the size in bytes

At the moment of crash I managed to record this from the nvidia-smi :

| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  On   | 00000000:44:00.0 Off |                    0 |
| N/A   42C    P0    88W / 400W |  40321MiB / 40960MiB |      1%      Default |
|                               |                      |             Disabled |

when running outside of the debugger or with cuda-memcheck or cuda-sanitizer it does not go over 5 GB memory , as repoted by nvidia-smi.


Could you please try the following (we need more details to properly diagnose the issue):

  • Try running the debugging session with CUDBG_USE_LEGACY_DEBUGGER=1 environment variable. Do you observe the same issue with that variable set?

  • What is the output of cuda-gdb --version command?

  • Collect additional logs:

    • Do not set the CUDBG_USE_LEGACY_DEBUGGER variable.

    • Add NVLOG_CONFIG_FILE variable pointing the nvlog.config file (attached). E.g.: NVLOG_CONFIG_FILE=${HOME}/nvlog.config
      nvlog.config (539 Bytes)

    • Run the debugging session.

    • You should see the /tmp/debugger.log file created - could you share it with us?


cuda-gdb --version

$ cuda-gdb --version
NVIDIA (R) CUDA Debugger
11.5 release

I ran the code in debugger as before and got a debugger.log file (debugger00.log). This run did not complete. It ran out of GPU memory and stopped.

Then I used the /nvlog.config file

Then I unset the nvlog variable and did export CUDBG_USE_LEGACY_DEBUGGER=1This run completed correctly. The GPU memory usage reported by nvidia-smi was within the limits. There was no log file.

debugger00.log (74.5 MB)
debugger01.log (74.4 MB)

Thank you very much for the details.
It looks like there is a bug in the GPU debugging module (which is a part of the GPU driver). To help us investigate and address the issue could you share the details on how we can reproduce the issue on our side? (e.g. machine setup, application…)

I am running it on the Mahti cluster using one GPU

Each node has two AMD Rome 7H12 CPUs with 64 cores, and 4 A100 GPUs. From the above nvidia-smi output one can see the driver version.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Sep_13_19:13:29_PDT_2021
Cuda compilation tools, release 11.5, V11.5.50
Build cuda_11.5.r11.5/compiler.30411180_0

The jobs are run using slurm:

$ squeue --version
slurm 22.05.8

I was able to reproduce the problem using srun ... cuda-gdb ... and also in an “interactive” session using salloc.....

The application is a code called TurboGAP (cpu version is public), I am porting to GPU some parts of it.
I wrote some fortran interfaces for the memcpy, cudamalloc, while the kernels are called using C wrappers.

There are no problems in sharing code with nvidia because in the end it will be publicly available, but is quite large and difficult to select specific parts to reproduce it yourself. I can tell that the bug appeared after I ported a specific subroutine in which I am allocating and deallocating some big arrays.


Hi @cristian.v.achim
Thank you for the details! We are looking at the issue.

I will update this topic when it’s resolved or if we need to ask for more details about the issue.

1 Like