Segmentation fault on the simplest example

I am training PyTorch models on the GPU and would get segfaults after a while(~20 minutes). The GPU would get into some bad state such that even the simplest CUDA sample(vectorAdd) couldn’t run. If I reboot the machine, then the sample can run again, but it’s only a matter of time that the GPU gets stuck and triggers segfault again. Any ideas?

Platform: GCP
OS: Ubuntu 18.04
GPU: Tesla T4
Nvidia driver version: 440.33.01
CUDA version: 10.2

The backtrace from the vectorAdd sample is:

/usr/local/cuda/samples/0_Simple/vectorAdd$ gdb vectorAdd
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright © 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type “show copying”
and “show warranty” for details.
This GDB was configured as “x86_64-linux-gnu”.
Type “show configuration” for configuration details.
For bug reporting instructions, please see:
Find the GDB manual and other documentation resources online at:
For help, type “help”.
Type “apropos word” to search for commands related to “word”…
Reading symbols from vectorAdd…(no debugging symbols found)…done.
(gdb) r
Starting program: /usr/local/cuda-10.2/samples/0_Simple/vectorAdd/vectorAdd
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”.
[Vector addition of 50000 elements]

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff5abe94b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0 0x00007ffff5abe94b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1 0x00007ffff5a840b5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007ffff5a85db5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007ffff5afad14 in cuInit () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x000055555556f354 in cudart::globalState::loadDriverInternal() ()
#5 0x0000555555570323 in cudart::__loadDriverInternalUtil() ()
#6 0x00007ffff79bd827 in __pthread_once_slow (
once_control=0x5555557d03d8 cudart::globalState::loadDriver()::loadDriverControl,
init_routine=0x555555570300 cudart::__loadDriverInternalUtil()) at pthread_once.c:116
#7 0x00005555555ad7f9 in cudart::cuosOnce(int*, void (*)()) ()
#8 0x0000555555571753 in cudart::globalState::initializeDriver() ()
#9 0x00005555555959e2 in cudaMalloc ()
#10 0x000055555555ae8e in main ()
(gdb)

The output from nvidia-smi is:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:04.0 Off | 0 |
| N/A 42C P8 9W / 70W | 36MiB / 15109MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2887 C nvidia-cuda-mps-server 25MiB |
±----------------------------------------------------------------------------+