I’m encountering an unknown error when the program calls cudaStreamSynchronize.
Instead of returning a cudaError_t, it crashes the program with a signal 11 (Segmentation Fault).
The relative offsets of the backtrace steps are unique across different crashes.
Subtracting the base address of libcuda.so, the offsets are as follows:
a segfault on an innocuous call into a CUDA library can sometimes be evidence of host code environment corruption, such as stack corruption.
If that were the case, it can be extraordinarily difficult to spot the error without some help. The usual suggestion I have is to divide and conquer. Remove pieces of your program until the indicated error does not occur. Then study the last section removed.
One possible option for debugging would be to use a time travel debugger to find out, where the nullptr was set. That would work only, if the nullptr was the direct original error and not, e.g. a wrong pointer to the nullptr, which was modified in memory.
Also try to create this error as early as possible, e.g. by calling cudaStreamSynchronize() more often.
Use a debug memory heap and run check functions for the memory heap often. The bug could destroy more memory structures than just the one in Cuda library regions. The more timely you find the error, the easier is it to narrow down, where it happened.
I greatly appreciate the replies from Robert and the Curefab.
We are currently in the process of reverting some questionable PRs and monitoring the resulting behaviors.
TBH, it is exceptionally challenging to attach some debugging tools, such as ASAN or compute-sanitizer, to our enormous system, which comprises dozens of modules running across multiple threads.
Furthermore, setting CUDA_LAUNCH_BLOCKING=1 or cudaStreamSynchronize() more frequently results in unexpected system behavior, preventing us from accurately reproducing the crash. This approach may even lead to system instability.
Your insights and suggestions are highly valued as we navigate these complexities.