Seemingly random Segmentation Faults

I am running an application which I developed in python. It consists of a few custom codelets and interfaces with Isaac Sim. Every once in a while, the application will crash with the message provided below. There doesn’t seem to be any pattern to when the application crashes. With so little useful information in the crash report and a bug which is not easily reproducible, I am at a loss as to how to debug this. The crashing of the application is a problem for be because I am training a machine learning model using simulation, so the application needs to be able to run for a long time.

Here is the crash message:

====================================================================================================
|                            Isaac application terminated unexpectedly                             |
====================================================================================================
#01 /home/autonomy/.cache/bazel/_bazel_autonomy/6b24af2a69ccd73917ff5525a5d0a656/execroot/isaac/bazel-out/k8-opt/bin/apps/charlotte_sim/training.runfiles/isaac/engine/pyalice/bindings.so(+0x12e59a) [0x7f7eb3fbd59a]
#02 google_breakpad::ExceptionHandler::GenerateDump(google_breakpad::ExceptionHandler::CrashContext*) /home/autonomy/.cache/bazel/_bazel_autonomy/6b24af2a69ccd73917ff5525a5d0a656/execroot/isaac/bazel-out/k8-opt/bin/apps/charlotte_sim/training.runfiles/isaac/engine/pyalice/bindings.so(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3f0) [0x7f7eb41220e0]
#03 google_breakpad::ExceptionHandler::SignalHandler(int, siginfo_t*, void*) /home/autonomy/.cache/bazel/_bazel_autonomy/6b24af2a69ccd73917ff5525a5d0a656/execroot/isaac/bazel-out/k8-opt/bin/apps/charlotte_sim/training.runfiles/isaac/engine/pyalice/bindings.so(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0xc0) [0x7f7eb4122450]
#04 /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f7eb5b19f20]
#05 /usr/bin/python3(PyBytes_FromStringAndSize+0xa5) [0x5a1a95]
#06 /home/autonomy/.cache/bazel/_bazel_autonomy/6b24af2a69ccd73917ff5525a5d0a656/execroot/isaac/bazel-out/k8-opt/bin/apps/charlotte_sim/training.runfiles/isaac/engine/pyalice/bindings.so(+0xd0751) [0x7f7eb3f5f751]
#07 /home/autonomy/.cache/bazel/_bazel_autonomy/6b24af2a69ccd73917ff5525a5d0a656/execroot/isaac/bazel-out/k8-opt/bin/apps/charlotte_sim/training.runfiles/isaac/engine/pyalice/bindings.so(+0xe679d) [0x7f7eb3f7579d]
#08 /home/autonomy/.cache/bazel/_bazel_autonomy/6b24af2a69ccd73917ff5525a5d0a656/execroot/isaac/bazel-out/k8-opt/bin/apps/charlotte_sim/training.runfiles/isaac/engine/pyalice/bindings.so(+0xf1292) [0x7f7eb3f80292]
#09 /home/autonomy/.cache/bazel/_bazel_autonomy/6b24af2a69ccd73917ff5525a5d0a656/execroot/isaac/bazel-out/k8-opt/bin/apps/charlotte_sim/training.runfiles/isaac/engine/pyalice/bindings.so(+0xd866c) [0x7f7eb3f6766c]
#10 /usr/bin/python3(_PyCFunction_FastCallDict+0x35c) [0x56204c]
#11 /usr/bin/python3() [0x4f88ba]
#12 /usr/bin/python3(_PyEval_EvalFrameDefault+0x467) [0x4f98c7]
#13 /usr/bin/python3() [0x4f7a28]
#14 /usr/bin/python3() [0x4f876d]
#15 /usr/bin/python3(_PyEval_EvalFrameDefault+0x467) [0x4f98c7]
#16 /usr/bin/python3() [0x4f7a28]
#17 /usr/bin/python3() [0x4f876d]
#18 /usr/bin/python3(_PyEval_EvalFrameDefault+0x467) [0x4f98c7]
#19 /usr/bin/python3() [0x4f7a28]
#20 /usr/bin/python3() [0x4f876d]
#21 /usr/bin/python3(_PyEval_EvalFrameDefault+0x467) [0x4f98c7]
#22 /usr/bin/python3() [0x4f7a28]
#23 /usr/bin/python3() [0x4f876d]
#24 /usr/bin/python3(_PyEval_EvalFrameDefault+0x467) [0x4f98c7]
#25 /usr/bin/python3() [0x4f7a28]
#26 /usr/bin/python3() [0x4f876d]
#27 /usr/bin/python3(_PyEval_EvalFrameDefault+0x467) [0x4f98c7]
#28 /usr/bin/python3() [0x4f7a28]
#29 /usr/bin/python3() [0x4f876d]
#30 /usr/bin/python3(_PyEval_EvalFrameDefault+0x467) [0x4f98c7]
#31 /usr/bin/python3(_PyFunction_FastCallDict+0xf5) [0x4f4065]
#32 /usr/bin/python3() [0x5a1481]
#33 /usr/bin/python3(PyObject_Call+0x3e) [0x57c2fe]
#34 /usr/bin/python3() [0x5e5ea2]
#35 /usr/bin/python3() [0x638084]
#36 /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f7eb58c36db]
#37 /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f7eb5bfc88f]
====================================================================================================
Minidump written to: /tmp/57d678d3-42c2-483f-d89037a7-8cc161fd.dmp
./charlotte_training_app.sh: line 5:  6587 Segmentation fault      (core dumped) bazel run apps/charlotte_sim/training

Some developments for anybody who is interested. The application seems to run for a long time when limited to a single CPU core (the longest I’ve let it run is about 25 minutes), but it always eventually segfaults with a similar stack trace to the one above. Increasing the stack size had no effect. The only thing I can think of is that there is some memory exchange happening somewhere between Python and C++, and Python give C++ access to some memory, but then releases the memory right after that.

The most frustrating thing about this bug is that it is wildly inconsistent. You can run the same code 10 times and the application never seems to crash in the same place. I’ve been at this problem for over a week now, so any assistance would be greatly appreciated.