Track RAM usage

Hello all, I have to process a 2 TB dataset, with 300 GB RAM and 4 Tesla with 32GB of global mem. The first step is to batch disk->RAM, then RAM->GPUs.

Using a 10 GB dataset, I tested the batch functions (disk->RAM and RAM->GPUs) by forcing the code to use only 1% of the RAM of the CPU, and 1% of the global mem of each GPU.

When running the code for the big dataset, however, it gives a “Bus error”. That probably means I am exceeding the total 300 GB RAM of that node, right?

Any suggestions on how to track that leak?

“Bus error” is a clear indication that the problem occurs in host code. It means that your program makes an access to an invalid memory address. Alternatively, it can be caused by a memory access that is not naturally aligned on a platform that does not support unaligned access.

Since most host platforms supported by CUDA support unaligned access to memory, the problem is very likely access to an invalid memory address. So you would want to check those, in particular the values of pointers used in the code. Possible error scenarios: (1) the pointer used may be uninitialized (2) a pointer containing a device address was de-referenced in host code (3) incorrect address arithmetic is performed upon a pointer (4) a pointer was corrupted due to an out-of-bound write elsewhere in the code.