Unknown error when call cudaStreamSynchronize

wcomaqsw · July 18, 2024, 8:04am

Hi Community,

I’m encountering an unknown error when the program calls cudaStreamSynchronize.
Instead of returning a cudaError_t, it crashes the program with a signal 11 (Segmentation Fault).

Here is an example backtrace:

    @     0x7f9f8a6f11b5  (unknown)
    @     0x7f9f8a58c644  (unknown)
    @     0x7f9f8a7bff66  (unknown)
    @     0x7f9f8a61b989  (unknown)
    @     0x7f9fa0121c90  (unknown)
    @     0x7f9fa01792e8  cudaStreamSynchronize
    @         0x117d75d0  utils::cuda::Stream::Sync()
    ...

The relative offsets of the backtrace steps are unique across different crashes.
Subtracting the base address of libcuda.so, the offsets are as follows:

0x7f9f8a6f11b5(offset 0x3bf1b5) <- 0x7f9f8a58c644(offset 0x25a644) <- 0x7f9f8a7bff66(offset 0x48df66) <- 0x7f9f8a61b989(offset 0x2e9989) <- 0x7f9fa0121c90(offset 0x15defc90) <- 0x7f9fa01792e8(offset 0x15e472e8)

The crash consistently occurs at offset 0x3bf1b5 in libcuda.so.

After examining the disassembly of libcuda.so, it appears that a nullptr is encountered while traversing an array of structures guarded by a mutex.

Please pardon me, I can’t provide the program.
the issue does not consistently reproduce — it occurs intermittently.

Does anyone have insights on the cause of this crash, or can provide suggestions on how to debug it?

System Info:

OS: Ubuntu 20.04
Kernel: 5.8.0.63-preempt
Driver: R515.86.01
CUDA Version: 11.7

Thanks in advance for any help!

Robert_Crovella · July 18, 2024, 4:30pm

a segfault on an innocuous call into a CUDA library can sometimes be evidence of host code environment corruption, such as stack corruption.

If that were the case, it can be extraordinarily difficult to spot the error without some help. The usual suggestion I have is to divide and conquer. Remove pieces of your program until the indicated error does not occur. Then study the last section removed.

Curefab · July 18, 2024, 6:52pm

Beside’s Robert’s good suggestion:

One possible option for debugging would be to use a time travel debugger to find out, where the nullptr was set. That would work only, if the nullptr was the direct original error and not, e.g. a wrong pointer to the nullptr, which was modified in memory.

Also try to create this error as early as possible, e.g. by calling cudaStreamSynchronize() more often.

Use a debug memory heap and run check functions for the memory heap often. The bug could destroy more memory structures than just the one in Cuda library regions. The more timely you find the error, the easier is it to narrow down, where it happened.

wcomaqsw · July 19, 2024, 3:49am

I greatly appreciate the replies from Robert and the Curefab.

We are currently in the process of reverting some questionable PRs and monitoring the resulting behaviors.

TBH, it is exceptionally challenging to attach some debugging tools, such as ASAN or compute-sanitizer, to our enormous system, which comprises dozens of modules running across multiple threads.
Furthermore, setting CUDA_LAUNCH_BLOCKING=1 or cudaStreamSynchronize() more frequently results in unexpected system behavior, preventing us from accurately reproducing the crash. This approach may even lead to system instability.

Your insights and suggestions are highly valued as we navigate these complexities.

system · August 2, 2024, 3:49am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
cudaSafeCall() Runtime API error 11 CUDA Programming and Performance	12	7249	March 9, 2012
Problem with cudaMalloc CUDA Programming and Performance	4	10102	October 29, 2008
Seemingly random crashes in CUDA application CUDA Programming and Performance	2	3679	January 27, 2011
CUDA segmentation Fault Error in Cudastream CUDA Programming and Performance	3	2744	April 24, 2017
Multiples launch of a single cudaGraphExec_t on the device creates a deadlock CUDA Programming and Performance cuda	2	510	August 15, 2023
Unspecifiec launch failure on CUDA_SAFE_CALL(cudaThreadSynchronize()) CUDA Programming and Performance	5	2119	January 27, 2011
cuStreamSynchronize() blocks forever? CUDA Programming and Performance	1	1794	March 18, 2012
strange extern shared Memory error CUDA Programming and Performance	6	5288	November 25, 2011
Program hit cudaErrorIllegalAddress (error 700) [...] on CUDA API call to cudaDeviceSynchronize CUDA-MEMCHECK	4	4922	September 29, 2021
Random, occasional "unknown error" after kernel CUDA Programming and Performance	5	23093	July 30, 2011

Unknown error when call cudaStreamSynchronize

Related topics