Segmentation fault on the simplest example

SilverBullet · October 25, 2020, 10:09pm

I am training PyTorch models on the GPU and would get segfaults after a while(~20 minutes). The GPU would get into some bad state such that even the simplest CUDA sample(vectorAdd) couldn’t run. If I reboot the machine, then the sample can run again, but it’s only a matter of time that the GPU gets stuck and triggers segfault again. Any ideas?

Platform: GCP
OS: Ubuntu 18.04
GPU: Tesla T4
Nvidia driver version: 440.33.01
CUDA version: 10.2

The backtrace from the vectorAdd sample is:

/usr/local/cuda/samples/0_Simple/vectorAdd$ gdb vectorAdd
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type “show copying”
and “show warranty” for details.
This GDB was configured as “x86_64-linux-gnu”.
Type “show configuration” for configuration details.
For bug reporting instructions, please see:
Find the GDB manual and other documentation resources online at:
For help, type “help”.
Type “apropos word” to search for commands related to “word”…
Reading symbols from vectorAdd…(no debugging symbols found)…done.
(gdb) r
Starting program: /usr/local/cuda-10.2/samples/0_Simple/vectorAdd/vectorAdd
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”.
[Vector addition of 50000 elements]

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff5abe94b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0 0x00007ffff5abe94b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1 0x00007ffff5a840b5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007ffff5a85db5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007ffff5afad14 in cuInit () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x000055555556f354 in cudart::globalState::loadDriverInternal() ()
#5 0x0000555555570323 in cudart::__loadDriverInternalUtil() ()
#6 0x00007ffff79bd827 in __pthread_once_slow (
once_control=0x5555557d03d8 cudart::globalState::loadDriver()::loadDriverControl,
init_routine=0x555555570300 cudart::__loadDriverInternalUtil()) at pthread_once.c:116
#7 0x00005555555ad7f9 in cudart::cuosOnce(int*, void (*)()) ()
#8 0x0000555555571753 in cudart::globalState::initializeDriver() ()
#9 0x00005555555959e2 in cudaMalloc ()
#10 0x000055555555ae8e in main ()
(gdb)

The output from nvidia-smi is:

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2887 C nvidia-cuda-mps-server 25MiB |
±----------------------------------------------------------------------------+

Topic		Replies	Views
Cuda-gdb segmentation fault when attaching pytorch training process CUDA-GDB cuda , pytorch , cuda-gdb	5	867	September 19, 2024
CUDA app segment fault Jetson TX2	7	551	October 18, 2021
Torch/torchvision on Orin NX 16GB Segfault Jetson Orin NX pytorch	14	1113	April 5, 2023
Segmentation fault when running TF detection tutorial Frameworks (archived) tensorflow	2	1986	December 16, 2019
Segmentation fault in JetPack 5.1 container when using CUDA device in PyTorch Jetson Xavier NX cuda , docker , pytorch , python	8	1020	March 30, 2023
Seg Faults in OCL SDK examples when run in cuda-gdb CUDA Programming and Performance	0	5762	May 5, 2011
SDK sample code failures only on samples that launch a kernel CUDA Programming and Performance	17	8828	January 7, 2009
run hello world segmentation fault CUDA Setup and Installation	6	2359	April 15, 2015
PyTorch "Segmentation fault (core dumped)" After Forward Propagation Jetson Xavier NX pytorch	2	3618	October 18, 2021
Cuda-gdb aborted CUDA-GDB	7	443	November 24, 2024

Segmentation fault on the simplest example

Related topics