Unknown error when call cudaStreamSynchronize

Hi Community,

I’m encountering an unknown error when the program calls cudaStreamSynchronize.
Instead of returning a cudaError_t, it crashes the program with a signal 11 (Segmentation Fault).

Here is an example backtrace:

    @     0x7f9f8a6f11b5  (unknown)
    @     0x7f9f8a58c644  (unknown)
    @     0x7f9f8a7bff66  (unknown)
    @     0x7f9f8a61b989  (unknown)
    @     0x7f9fa0121c90  (unknown)
    @     0x7f9fa01792e8  cudaStreamSynchronize
    @         0x117d75d0  utils::cuda::Stream::Sync()
    ...

The relative offsets of the backtrace steps are unique across different crashes.
Subtracting the base address of libcuda.so, the offsets are as follows:

0x7f9f8a6f11b5(offset 0x3bf1b5) <- 0x7f9f8a58c644(offset 0x25a644) <- 0x7f9f8a7bff66(offset 0x48df66) <- 0x7f9f8a61b989(offset 0x2e9989) <- 0x7f9fa0121c90(offset 0x15defc90) <- 0x7f9fa01792e8(offset 0x15e472e8)

The crash consistently occurs at offset 0x3bf1b5 in libcuda.so.

After examining the disassembly of libcuda.so, it appears that a nullptr is encountered while traversing an array of structures guarded by a mutex.

Please pardon me, I can’t provide the program.
the issue does not consistently reproduce — it occurs intermittently.

Does anyone have insights on the cause of this crash, or can provide suggestions on how to debug it?

System Info:

  • OS: Ubuntu 20.04
  • Kernel: 5.8.0.63-preempt
  • Driver: R515.86.01
  • CUDA Version: 11.7

Thanks in advance for any help!

a segfault on an innocuous call into a CUDA library can sometimes be evidence of host code environment corruption, such as stack corruption.

If that were the case, it can be extraordinarily difficult to spot the error without some help. The usual suggestion I have is to divide and conquer. Remove pieces of your program until the indicated error does not occur. Then study the last section removed.

1 Like

Beside’s Robert’s good suggestion:

One possible option for debugging would be to use a time travel debugger to find out, where the nullptr was set. That would work only, if the nullptr was the direct original error and not, e.g. a wrong pointer to the nullptr, which was modified in memory.

Also try to create this error as early as possible, e.g. by calling cudaStreamSynchronize() more often.

Use a debug memory heap and run check functions for the memory heap often. The bug could destroy more memory structures than just the one in Cuda library regions. The more timely you find the error, the easier is it to narrow down, where it happened.

1 Like

I greatly appreciate the replies from Robert and the Curefab.

We are currently in the process of reverting some questionable PRs and monitoring the resulting behaviors.

TBH, it is exceptionally challenging to attach some debugging tools, such as ASAN or compute-sanitizer, to our enormous system, which comprises dozens of modules running across multiple threads.
Furthermore, setting CUDA_LAUNCH_BLOCKING=1 or cudaStreamSynchronize() more frequently results in unexpected system behavior, preventing us from accurately reproducing the crash. This approach may even lead to system instability.

Your insights and suggestions are highly valued as we navigate these complexities.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.