How to detect that a CUDA code is failing due to memory constraints?

I have a CUDA code that on multiGPU times out after a while. This happens sometimes randomly and the issue is not reproducible. To test if the issue was due to memory constraint, I ran a different problem to keep the GPUs slightly occupied. Now upon running the original problem, I noticed that the original code failed consistently. In all cases, nvidia-smi showed that the memory use for each GPU is far less (less than 1/5 used). How to detect memory issues in CUDA codes?

Check the return status of all CUDA API functions, in particular those that allocate memory.

compute-sanitizer provides convenient checking of various kind of violations including failed API calls. It can find many (but is no guaranteed to find all) issues in CUDA programs. Random failures are most often the result of (1) use of uninitialized data (2) out-of-bounds memory access (3) race conditions.

1 Like

Thanks! By failure, I mean indefinite waiting of the program. Do the reasons you mentioned hold for the timeout cases?

By and large, yes. For example, you could have a loop based on an uninitialized variable. Or a dead-lock caused by a race condition or incorrect use of synchronization primitives. Note that the root cause could be in host code, it doesn’t have to be inside the device code.

Before coming up with hypotheses for specific failure scenarios, it is easiest and a best practice to use readily available tool to eliminate as many issues statically (e.g. -Wall -Werror) and dynamically (e.g. with valgrind and compute-sanitizer) as possible.

1 Like