How to find leaks? cuda-gdb runs out of memory, but compute-sanitizer runs without erros

cristian.v.achim · March 17, 2023, 12:34pm

Hello,

I am running some tests on my code on a cluster. Something peculiar happens.
When I run the code with cuda-gdb the memory usage increases continuously and eventually it runs out of memory and crashes.
My first thought was that I have some gpu variables which are not deallocated so I ran the code with cuda-memcheck and compute-sanitizer, but no errors were reported. The memory usage was below some threshold.
In both cases I compiled the cuda part with -G argument.
I did also a manual check. I used cat <code> | grep malloc and matched them with the out put of cat <code> | grep Free.

What else can I do to find the bugs? Is it possible there is a problem with running cuda-gdb on clusters?.

Best regsards,

Cristian

Robert_Crovella · March 17, 2023, 1:42pm

If you are talking about host memory leaks, none of the tools you suggest are designed to find those. A host memory leak can cause a program crash (the os terminating the program for you automatically). A device memory leak could only lead to an eventual runtime error, not a crash as I have defined it. If you are chasing a host memory leak, this is probably the wrong forum to be asking about it.

compute-sanitizer (and cuda-memcheck) both have leak-checking capability for device memory. If you haven’t enabled those features, try that.

If you think there is a problem with cuda-gdb, I would suggest upgrading to the latest version. If the problem still exists, then I suggest reporting it on the cuda-gdb forum, not here. I also suggest asking questions about cuda-gdb specifically, there not here.

As a final suggestion, to chase a difficult to find bug there are many “typical” debugging techniques. A common one is to progressively strip down your code, removing parts. As you run the shrinking code, check for memory leaks. When the memory leaks stop, then study what you most recently removed.

cristian.v.achim · March 17, 2023, 2:45pm

Hello,

Thank you for your reply. Sorry for the misunderstanding. In the original message I was referring only to gpu memory.
The gpu memory usage is increasing linearly while running my code with cuda-gdb. When the GPU memory is filled the code crashes.
I am running my code on a cluster and upgrading is not an option.

I am using version 11.5.

It is weird. I do not understand why running my code with cuda-gdb the gpu memory usage is constantly increasing, while running normally or inside compute-sanitizer the amount of memory is below some threshold (going up and down as the code allocates and deallocates the memory for the various variables). It should happen in both cases, if allocations of device memory using cudaMalloc() that have not been freed

I realized only now (though spent some time digging) that the flag --leak-check full is needed to check the memory leaks caused by cudaMalloc. I got this summary from cuda-memcheck --leak-cheak full

========= CUDA-MEMCHECK
========= This tool is deprecated and will be removed in a future release of the CUDA toolkit
========= Please use the compute-sanitizer tool as a drop-in replacement
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 0 errors

So something really weird happens

AKravets · March 17, 2023, 3:57pm

Hi @cristian.v.achim
It’s expected for cuda-gdb to use a limited amount of GPU memory, but what you are seeing (debugger consuming all GPU memory) might be a bug in cuda-gdb.

Could you share additional details about the issue:

cuda-gdb output when debugging your application
nvidia-smi output

cristian.v.achim · March 17, 2023, 4:19pm

Hello,

Here is the output from the last run:

This is the cuda-gdb output:

warning: Cuda API error detected: cudaMallocAsync returned (0x2)

GPUassert: out of memory src/cuda_wrappers.cu 56 // here I have  an allocation cudaMallocAsync( a_d,  Np ,0); // Np is the size in bytes

At the moment of crash I managed to record this from the nvidia-smi :

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:44:00.0 Off |                    0 |
| N/A   42C    P0    88W / 400W |  40321MiB / 40960MiB |      1%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

when running outside of the debugger or with cuda-memcheck or cuda-sanitizer it does not go over 5 GB memory , as repoted by nvidia-smi.

Cristian

AKravets · March 20, 2023, 2:53pm

Hi,
Could you please try the following (we need more details to properly diagnose the issue):

Try running the debugging session with CUDBG_USE_LEGACY_DEBUGGER=1 environment variable. Do you observe the same issue with that variable set?
What is the output of cuda-gdb --version command?
Collect additional logs:
- Do not set the CUDBG_USE_LEGACY_DEBUGGER variable.
- Add NVLOG_CONFIG_FILE variable pointing the nvlog.config file (attached). E.g.: NVLOG_CONFIG_FILE=${HOME}/nvlog.config
  nvlog.config (539 Bytes)
- Run the debugging session.
- You should see the /tmp/debugger.log file created - could you share it with us?

cristian.v.achim · March 21, 2023, 11:26am

Hello,

cuda-gdb --version

$ cuda-gdb --version
NVIDIA (R) CUDA Debugger
11.5 release

I ran the code in debugger as before and got a debugger.log file (debugger00.log). This run did not complete. It ran out of GPU memory and stopped.

Then I used the /nvlog.config file

Then I unset the nvlog variable and did export CUDBG_USE_LEGACY_DEBUGGER=1This run completed correctly. The GPU memory usage reported by nvidia-smi was within the limits. There was no log file.

debugger00.log (74.5 MB)
debugger01.log (74.4 MB)

AKravets · March 21, 2023, 1:29pm

Thank you very much for the details.
It looks like there is a bug in the GPU debugging module (which is a part of the GPU driver). To help us investigate and address the issue could you share the details on how we can reproduce the issue on our side? (e.g. machine setup, application…)

cristian.v.achim · March 21, 2023, 2:23pm

I am running it on the Mahti cluster using one GPU

Each node has two AMD Rome 7H12 CPUs with 64 cores, and 4 A100 GPUs. From the above nvidia-smi output one can see the driver version.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Sep_13_19:13:29_PDT_2021
Cuda compilation tools, release 11.5, V11.5.50
Build cuda_11.5.r11.5/compiler.30411180_0

The jobs are run using slurm:

$ squeue --version
slurm 22.05.8

I was able to reproduce the problem using srun ... cuda-gdb ... and also in an “interactive” session using salloc.....

The application is a code called TurboGAP (cpu version is public), I am porting to GPU some parts of it.
I wrote some fortran interfaces for the memcpy, cudamalloc, while the kernels are called using C wrappers.

There are no problems in sharing code with nvidia because in the end it will be publicly available, but is quite large and difficult to select specific parts to reproduce it yourself. I can tell that the bug appeared after I ported a specific subroutine in which I am allocating and deallocating some big arrays.

Cristian

AKravets · March 22, 2023, 10:51am

Hi @cristian.v.achim
Thank you for the details! We are looking at the issue.

I will update this topic when it’s resolved or if we need to ask for more details about the issue.

Topic		Replies	Views
Potential memory leak - compute-sanitizer shows nothing CUDA Programming and Performance camera , cuda , jetson	10	667	September 10, 2024
Detecting memory leaks CUDA Programming and Performance	3	648	June 3, 2023
Any way/tool to detect gpu memory leak? CUDA Programming and Performance	1	334	October 10, 2023
Custom memory manager with compute-sanitizer Compute Sanitizer	2	544	March 11, 2025
Debugging CUDA More Efficiently with NVIDIA Compute Sanitizer Technical Blog	7	836	February 7, 2024
Should we expect cuda-gdb to repeatedly allocate and deallocate memory on the fly? CUDA-GDB	7	809	May 17, 2021
Failed to detect memory leak on Jetson device Compute Sanitizer	3	973	September 6, 2023
Cudbgprocess CUDA Programming and Performance	22	3135	August 2, 2022
Cuda memcheck tool not detecting device memory leaks Compute Sanitizer cuda	6	865	March 8, 2024
Multiple streams on 1 GPU and out of memory error Compute Sanitizer	2	1339	August 10, 2022

How to find leaks? cuda-gdb runs out of memory, but compute-sanitizer runs without erros

Related topics