Ok so here is the general gist and some fast tracking of info:
i have modprobe blacklisting nouveau and nvidiafb
Specs:
Ubuntu 20.04
Zotac RTX3060 LHR
Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz 12 core
*lsmem:*
Memory block size: 128M
Total online memory: 16G
Total offline memory: 0B
dkms status
nvidia, 470.82.01, 5.4.0-91-generic, x86_64: installed
SierraLinuxQMIdrivers, S2.42N2.64, 5.4.0-90-generic, x86_64: installed
SierraLinuxQMIdrivers, S2.42N2.64, 5.4.0-91-generic, x86_64: installed
uname -r
5.4.0-91-generic
In the attached report file you will see many attempts to load newer drivers. only 460 would work for me due to a conflict at the kernel api version. This problem persisted with 460 and when attempting to upgrade to 495 the drivers failed to initialize during boot. After cleaning the entire system of libnvidia*, cuda* and nvidia* packages i got it upgraded to 470 (after a failed attempt at driver 495) and the problem persisted. I will try upgrading again to driver 495 if you think it will fix the issue but this will take a lot of time to figure out why its failing to install and initialize.
nvidia-bug-report.log.gz (156.7 KB)
OK so here is a kicker this seems to happen when i am running a time of flight inference on lidar using a python program i wrote and am not able to share publicly. I initially thought that tensorflow, numpy, scipi and scikit learn were using too many threads or virtual memory but after adjusting the threads to be 1 vs 12 this problem still persisted.
it seems the drivers or the cuda runtime is crashing after being run for an indeterminate amount of time. Right now what I need help with is figuring out why it is crashing and if the logs shared here can give you any insight to the problem.
This problem does not persist when I am not running my inference script so there is something kicking the cuda or drivers off the bus and I would like some help finding out what is causing this. by process of elimination there is some relationship between my code running pytorch and numpy. i am using the KITTI dataset for testing with the pointpillars three class model. I believe the current version of pytorch uses the cuda toolkit API for 11.1 this could be a possible issue if there is incompatibilities between API versions. torch does not exist for cuda 11.4 or 11.5 its last supported version is cuda 11.3 and im not sure how to install the driver separate from the toolkit to test this hypothesis as 470 comes with 11.4 and 460 comes with 11.2 (also not supported)
Below is the data for each version:
python3.8
torch.version
‘1.8.2+cu111’
numpy.version
‘1.19.5’
sklearn.version
‘0.24.2’
scipy.version
‘1.7.1’
Their respective driver and threading info:
[{'architecture': 'Haswell',
'filepath': '/home/metrolla/.local/lib/python3.8/site-packages/numpy.libs/libopenblasp-r0-09e95953.3.13.so',
'internal_api': 'openblas',
'num_threads': 1,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.13'},
{'filepath': '/home/metrolla/.local/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1',
'internal_api': 'openmp',
'num_threads': 1,
'prefix': 'libgomp',
'user_api': 'openmp',
'version': None},
{'architecture': 'Haswell',
'filepath': '/home/metrolla/.local/lib/python3.8/site-packages/scipy.libs/libopenblasp-r0-085ca80a.3.9.so',
'internal_api': 'openblas',
'num_threads': 1,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.9'},
{'filepath': '/home/metrolla/.local/lib/python3.8/site-packages/scikit_learn.libs/libgomp-f7e03b3e.so.1.0.0',
'internal_api': 'openmp',
'num_threads': 1,
'prefix': 'libgomp',
'user_api': 'openmp',
'version': None}]
Thank you in advance for all your help. I hope by front loading this question with all the data you can help me see something I am missing.