host malloc segmentation fault


i am debugging a project, and the project can not yet stand on its feet; i run it via the debugger

some way into the program, i get a segmentation fault on a (host) memory allocation - malloc calls int_malloc, and the latter returns a sigsegv

if i add another malloc just prior to the malloc/ line yielding the segmentation fault, of smaller size (300 * sizeof(int) vs 3000 * sizeof(int)), that particular malloc succeeds

htop shows memory prior to launching the program (via the debugger) at 2107/ 3907, and at the time of the sigsegv signal at 2540/ 3907

when i call getrlimit on RLIMIT_STACK and RLIMIT_AS, i get values returned that suggests to me that these limits are in order (cannot be the cause)

why would the malloc be denied then?

The best working hypothesis I can think of is that prior activity in the application damages the internal data structures that control memory allocation (like links that chain free blocks). Check for pointer use on freed allocations, writes out of bounds on previous allocations, multiple calls to free() on the same pointer.

I assume you are on Linux. If so there are a couple of tools that can help you pinpoint the problem:

(1) Export the environment variable MALLOC_CHECK_=1 to log malloc/free/realloc activity. You can also try MALLOC_CHECK_=2 to bail at the first sign that things are out of whack.

(2) Run the code under valgrind. Note that this can cause massive slowdown, so it may or may not be an option. At least with older versions of valgrind there were also some issue with false positives reported on host data structures into which the GPU transfers data by DMA.

sound hypothesis; excellent tools; major logic flaw

you suggested valgrind previously, but, as i was still debugging, i was hesitant to forward to valgrind a program that can barely crawl, let alone stand upright

but malloc_check_ is rather a gem in its own right
i struggle to read the output, i just follow the gestures
and it quickly pointed out flaws in my iteration logic, that i could not pick up through stepping the program with the debugger

it will likely take a lot more time to finish debugging the program, but at least i now get past the previous sigsegv line

thanks, cuda oracle njuffa