How to detect that a CUDA code is failing due to memory constraints?

skps23 · April 4, 2023, 3:45pm

I have a CUDA code that on multiGPU times out after a while. This happens sometimes randomly and the issue is not reproducible. To test if the issue was due to memory constraint, I ran a different problem to keep the GPUs slightly occupied. Now upon running the original problem, I noticed that the original code failed consistently. In all cases, nvidia-smi showed that the memory use for each GPU is far less (less than 1/5 used). How to detect memory issues in CUDA codes?

njuffa · April 4, 2023, 5:21pm

Check the return status of all CUDA API functions, in particular those that allocate memory.

compute-sanitizer provides convenient checking of various kind of violations including failed API calls. It can find many (but is no guaranteed to find all) issues in CUDA programs. Random failures are most often the result of (1) use of uninitialized data (2) out-of-bounds memory access (3) race conditions.

skps23 · April 4, 2023, 6:18pm

Thanks! By failure, I mean indefinite waiting of the program. Do the reasons you mentioned hold for the timeout cases?

njuffa · April 4, 2023, 7:58pm

By and large, yes. For example, you could have a loop based on an uninitialized variable. Or a dead-lock caused by a race condition or incorrect use of synchronization primitives. Note that the root cause could be in host code, it doesn’t have to be inside the device code.

Before coming up with hypotheses for specific failure scenarios, it is easiest and a best practice to use readily available tool to eliminate as many issues statically (e.g. -Wall -Werror) and dynamically (e.g. with valgrind and compute-sanitizer) as possible.

Topic		Replies	Views
Compute-sanitizer detecting No Memcheck error? CUDA Programming and Performance	8	855	July 26, 2022
CUDA Memory Error Inspection Method Compute Sanitizer cuda	5	1063	June 24, 2024
Custom memory manager with compute-sanitizer Compute Sanitizer	2	543	March 11, 2025
Compute sanitizer hangs indefinitely while cuda-memcheck works as is Compute Sanitizer cuda	7	1974	March 12, 2025
Debugging CUDA More Efficiently with NVIDIA Compute Sanitizer Technical Blog	7	835	February 7, 2024
Potential memory leak - compute-sanitizer shows nothing CUDA Programming and Performance camera , cuda , jetson	10	665	September 10, 2024
CUDA and GPU memory How to get CUDA to exit cleanly when a routine demands too much memory CUDA Programming and Performance	2	7628	February 11, 2010
About the Compute Sanitizer category Compute Sanitizer	1	1492	May 23, 2023
Detecting memory leaks CUDA Programming and Performance	3	647	June 3, 2023
Compute Sanitizer for OpenAcc and OpenMPI Compute Sanitizer	2	1416	March 9, 2023

How to detect that a CUDA code is failing due to memory constraints?

Related topics