Hello,
I am running some tests on my code on a cluster. Something peculiar happens.
When I run the code with cuda-gdb
the memory usage increases continuously and eventually it runs out of memory and crashes.
My first thought was that I have some gpu variables which are not deallocated so I ran the code with cuda-memcheck
and compute-sanitizer
, but no errors were reported. The memory usage was below some threshold.
In both cases I compiled the cuda part with -G
argument.
I did also a manual check. I used cat <code> | grep malloc
and matched them with the out put of cat <code> | grep Free
.
What else can I do to find the bugs? Is it possible there is a problem with running cuda-gdb
on clusters?.
Best regsards,
Cristian
If you are talking about host memory leaks, none of the tools you suggest are designed to find those. A host memory leak can cause a program crash (the os terminating the program for you automatically). A device memory leak could only lead to an eventual runtime error, not a crash as I have defined it. If you are chasing a host memory leak, this is probably the wrong forum to be asking about it.
compute-sanitizer
(and cuda-memcheck
) both have leak-checking capability for device memory. If you haven’t enabled those features, try that.
If you think there is a problem with cuda-gdb
, I would suggest upgrading to the latest version. If the problem still exists, then I suggest reporting it on the cuda-gdb
forum, not here. I also suggest asking questions about cuda-gdb
specifically, there not here.
As a final suggestion, to chase a difficult to find bug there are many “typical” debugging techniques. A common one is to progressively strip down your code, removing parts. As you run the shrinking code, check for memory leaks. When the memory leaks stop, then study what you most recently removed.
Hello,
Thank you for your reply. Sorry for the misunderstanding. In the original message I was referring only to gpu memory.
The gpu memory usage is increasing linearly while running my code with cuda-gdb
. When the GPU memory is filled the code crashes.
I am running my code on a cluster and upgrading is not an option.
I am using version 11.5.
It is weird. I do not understand why running my code with cuda-gdb
the gpu memory usage is constantly increasing, while running normally or inside compute-sanitizer
the amount of memory is below some threshold (going up and down as the code allocates and deallocates the memory for the various variables). It should happen in both cases, if allocations of device memory using cudaMalloc() that have not been freed
I realized only now (though spent some time digging) that the flag --leak-check full
is needed to check the memory leaks caused by cudaMalloc
. I got this summary from cuda-memcheck --leak-cheak full
========= CUDA-MEMCHECK
========= This tool is deprecated and will be removed in a future release of the CUDA toolkit
========= Please use the compute-sanitizer tool as a drop-in replacement
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 0 errors
So something really weird happens
Hi @cristian.v.achim
It’s expected for cuda-gdb
to use a limited amount of GPU memory, but what you are seeing (debugger consuming all GPU memory) might be a bug in cuda-gdb
.
Could you share additional details about the issue:
-
cuda-gdb
output when debugging your application
-
nvidia-smi
output
Hello,
Here is the output from the last run:
This is the cuda-gdb
output:
warning: Cuda API error detected: cudaMallocAsync returned (0x2)
GPUassert: out of memory src/cuda_wrappers.cu 56 // here I have an allocation cudaMallocAsync( a_d, Np ,0); // Np is the size in bytes
At the moment of crash I managed to record this from the nvidia-smi
:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:44:00.0 Off | 0 |
| N/A 42C P0 88W / 400W | 40321MiB / 40960MiB | 1% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
when running outside of the debugger or with cuda-memcheck
or cuda-sanitizer
it does not go over 5 GB memory , as repoted by nvidia-smi
.
Cristian
Hi,
Could you please try the following (we need more details to properly diagnose the issue):
Hello,
cuda-gdb --version
$ cuda-gdb --version
NVIDIA (R) CUDA Debugger
11.5 release
I ran the code in debugger as before and got a debugger.log file (debugger00.log). This run did not complete. It ran out of GPU memory and stopped.
Then I used the /nvlog.config
file
Then I unset the nvlog variable and did export CUDBG_USE_LEGACY_DEBUGGER=1
This run completed correctly. The GPU memory usage reported by nvidia-smi
was within the limits. There was no log file.
debugger00.log (74.5 MB)
debugger01.log (74.4 MB)
Thank you very much for the details.
It looks like there is a bug in the GPU debugging module (which is a part of the GPU driver). To help us investigate and address the issue could you share the details on how we can reproduce the issue on our side? (e.g. machine setup, application…)
I am running it on the Mahti cluster using one GPU
Each node has two AMD Rome 7H12 CPUs with 64 cores, and 4 A100 GPUs. From the above nvidia-smi
output one can see the driver version.
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Sep_13_19:13:29_PDT_2021
Cuda compilation tools, release 11.5, V11.5.50
Build cuda_11.5.r11.5/compiler.30411180_0
The jobs are run using slurm:
$ squeue --version
slurm 22.05.8
I was able to reproduce the problem using srun ... cuda-gdb ...
and also in an “interactive” session using salloc....
.
The application is a code called TurboGAP (cpu version is public), I am porting to GPU some parts of it.
I wrote some fortran interfaces for the memcpy, cudamalloc, while the kernels are called using C wrappers.
There are no problems in sharing code with nvidia because in the end it will be publicly available, but is quite large and difficult to select specific parts to reproduce it yourself. I can tell that the bug appeared after I ported a specific subroutine in which I am allocating and deallocating some big arrays.
Cristian
Hi @cristian.v.achim
Thank you for the details! We are looking at the issue.
I will update this topic when it’s resolved or if we need to ask for more details about the issue.
1 Like