Track RAM usage

Hello all, I have to process a 2 TB dataset, with 300 GB RAM and 4 Tesla with 32GB of global mem. The first step is to batch disk->RAM, then RAM->GPUs.

Using a 10 GB dataset, I tested the batch functions (disk->RAM and RAM->GPUs) by forcing the code to use only 1% of the RAM of the CPU, and 1% of the global mem of each GPU.

When running the code for the big dataset, however, it gives a “Bus error”. That probably means I am exceeding the total 300 GB RAM of that node, right?

Any suggestions on how to track that leak?

“Bus error” is a clear indication that the problem occurs in host code. It means that your program makes an access to an invalid memory address. Alternatively, it can be caused by a memory access that is not naturally aligned on a platform that does not support unaligned access.

Since most host platforms supported by CUDA support unaligned access to memory, the problem is very likely access to an invalid memory address. So you would want to check those, in particular the values of pointers used in the code. Possible error scenarios: (1) the pointer used may be uninitialized (2) a pointer containing a device address was de-referenced in host code (3) incorrect address arithmetic is performed upon a pointer (4) a pointer was corrupted due to an out-of-bound write elsewhere in the code.

Checking (1), (2), (3), and (4) (although 4 is hard to check), I could not find any errors by myself.
Running on the small 10GB example, cuda-memcheck shows no errors. Using Valgrind returns errors on NCCL functions, and some other host functions worth investigating. The thing is, when there is no “Bus error”, the diff of the results is zero when compared to a previous version of the code. Moreover, are there alternative tools to Valgrind for checking for memory leaks (of host code) on CUDA C code?

In any case, thanks for the reply, even though it is a broad question

a bus error is akin to a seg fault. That means there is a single line of host code that the bus error can be resolved/isolated to. Furthermore, if it is repeatable, you should be able to stop before the line gets executed, and print all relevant data, such as pointer numerical values.

That information (I suspect) could be used to reduce the problem to a trivial test case, which would likely increase your understanding of what is happening.

If you run the faulting application from inside gdb, it should report the program counter (PC) / instruction pointer (IP) associated with the bus error, e.g.

Program received signal SIGBUS, Bus error.
0x080483ba in main ()

If the host code was compiled with the appropriate debug symbol settings, gdb should then be able to point you directly at the relevant line in the source code associated with that PC / IP. You can then trace backwards from there to identify the root cause.