RTX3060 LHR falls off bus - Ubuntu 20.04 running pytorch and numpy inference code

Ok so here is the general gist and some fast tracking of info:
i have modprobe blacklisting nouveau and nvidiafb


Ubuntu 20.04
Zotac RTX3060 LHR
Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz 12 core

Memory block size:       128M
Total online memory:      16G
Total offline memory:      0B

dkms status

nvidia, 470.82.01, 5.4.0-91-generic, x86_64: installed
SierraLinuxQMIdrivers, S2.42N2.64, 5.4.0-90-generic, x86_64: installed
SierraLinuxQMIdrivers, S2.42N2.64, 5.4.0-91-generic, x86_64: installed

uname -r

In the attached report file you will see many attempts to load newer drivers. only 460 would work for me due to a conflict at the kernel api version. This problem persisted with 460 and when attempting to upgrade to 495 the drivers failed to initialize during boot. After cleaning the entire system of libnvidia*, cuda* and nvidia* packages i got it upgraded to 470 (after a failed attempt at driver 495) and the problem persisted. I will try upgrading again to driver 495 if you think it will fix the issue but this will take a lot of time to figure out why its failing to install and initialize.
nvidia-bug-report.log.gz (156.7 KB)

OK so here is a kicker this seems to happen when i am running a time of flight inference on lidar using a python program i wrote and am not able to share publicly. I initially thought that tensorflow, numpy, scipi and scikit learn were using too many threads or virtual memory but after adjusting the threads to be 1 vs 12 this problem still persisted.

it seems the drivers or the cuda runtime is crashing after being run for an indeterminate amount of time. Right now what I need help with is figuring out why it is crashing and if the logs shared here can give you any insight to the problem.

This problem does not persist when I am not running my inference script so there is something kicking the cuda or drivers off the bus and I would like some help finding out what is causing this. by process of elimination there is some relationship between my code running pytorch and numpy. i am using the KITTI dataset for testing with the pointpillars three class model. I believe the current version of pytorch uses the cuda toolkit API for 11.1 this could be a possible issue if there is incompatibilities between API versions. torch does not exist for cuda 11.4 or 11.5 its last supported version is cuda 11.3 and im not sure how to install the driver separate from the toolkit to test this hypothesis as 470 comes with 11.4 and 460 comes with 11.2 (also not supported)

Below is the data for each version:





Their respective driver and threading info:

[{'architecture': 'Haswell',
  'filepath': '/home/metrolla/.local/lib/python3.8/site-packages/numpy.libs/libopenblasp-r0-09e95953.3.13.so',
  'internal_api': 'openblas',
  'num_threads': 1,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.13'},
 {'filepath': '/home/metrolla/.local/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1',
  'internal_api': 'openmp',
  'num_threads': 1,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None},
 {'architecture': 'Haswell',
  'filepath': '/home/metrolla/.local/lib/python3.8/site-packages/scipy.libs/libopenblasp-r0-085ca80a.3.9.so',
  'internal_api': 'openblas',
  'num_threads': 1,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.9'},
 {'filepath': '/home/metrolla/.local/lib/python3.8/site-packages/scikit_learn.libs/libgomp-f7e03b3e.so.1.0.0',
  'internal_api': 'openmp',
  'num_threads': 1,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None}]

Thank you in advance for all your help. I hope by front loading this question with all the data you can help me see something I am missing.

Likely falling off the bus due to insufficient power on power spikes. Try limiting the clocks using nvidia-smi -lgc

I decided to nuke the nvidia drivers and the cuda drivers. I bumped it up to cuda 11.3 and nvidia driver 470 by using runfiles (I discovered even though its suggested driver 495 from ubuntu and elsewhere it turns out it is not suggested for use by the driver recommendation tool) then I rebuilt pytorch for cu113 and rebuilt mmdetection for pytorch and its been running stable for close to 24 hours now. I will let it run all through the week to see and that will be my sign its probably not the power but a driver incompatibility with a library.

I honestly hope that is all it is.