After 5 supporters telling me they are not responsible for my problem, I try it here again:
In a nutshell:
CUDA reports “unknown error” when running the samples to test the installation.
Console output of my problem, nvcc, nvidia-smi & lsmod are stated here:
The log from the nvidia-bug-report.sh & the installation log is attached to support reference 220710-000470.
I can post them here too if you like but it’s quite some log so… probably just cluttering the post.
My intention is to use the CUDNN backend for OpenCV. OpenCV compiles without any problems and works up to the point where I try to use the CUDA backend.
OpenCV throws the following error at me which is in my opinion a symptom of a lower level issue (also visible from the issues stated above)
terminate called after throwing an instance of 'cv::dnn::cuda4dnn::csl::CUDAException'
what(): OpenCV(4.6.0-dev) /home/lrmts/Downloads/OpenCV/opencv-4.x/modules/dnn/src/cuda4dnn/csl/memory.hpp:54: error: (-217:Gpu API call) unknown error in function 'ManagedPtr'
any input on where the problem might be is highly appreciated.
nvidia-uvm isn’t loaded. Please put it in the list of modules to load on boot or install nvidia-modprobe so normal users can load it or run deviceQuery once as root to load it.
Thanks for the quick response. So, if I understand correctly I’d need to add nvidia-uvm to /etc/modules or install nvidia-modprobe correct?
I have already run deviceQuery as root with the same result. I also have tried running nvidia-modprobe but deviceQuery still has the same issue.
Please check if the nvidia-uvm module gets loaded. I f it’s there, please create a nvidia-bug-report.log and attach.
Hi, sorry for the very long delay… I was assigned to another project the last two weeks…
based on lsmod it really looks like nvidia-uvm is not loaded:
$ lsmod|grep nvidia
nvidia_drm 69632 2
nvidia_modeset 1142784 4 nvidia_drm
nvidia 40804352 130 nvidia_modeset
drm_kms_helper 307200 2 nvidia_drm,i915
nvidia_wmi_ec_backlight 16384 0
drm 606208 12 drm_kms_helper,nvidia,nvidia_drm,i915,ttm
wmi 32768 3 hp_wmi,nvidia_wmi_ec_backlight,wmi_bmof
running nvidia-modprobe does not change anything.
The output from nvidia-bug-report.sh (run with sudo) is attached.
nvidia-bug-report.log (2.7 MB)
Thanks in advance for your inputs
It’s also visible in the logs, the module can’t be loaded:
systemd-udevd: nvidia: Process '/sbin/modprobe nvidia-uvm' failed with exit code 1.
Please post the outputs of
sudo modinfo nvidia-uvm
sudo modprobe nvidia-uvm
$ sudo modinfo nvidia-uvm
[sudo] password for lrmts:
license: Dual MIT/GPL
vermagic: 5.15.0-41-generic SMP mod_unload modversions
signer: ubuntu Secure Boot Module Signature key
parm: uvm_ats_mode:Set to 0 to disable ATS (Address Translation Services). Any other value is ignored. Has no effect unless the platform supports ATS. (int)
parm: uvm_disable_hmm:Force-disable HMM functionality in the UVM driver. Default: false (i.e, HMM is potentially enabled). Ignored if HMM is not supported in the driver, or if ATS settings conflict with HMM. (bool)
parm: uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int)
parm: uvm_cpu_chunk_allocation_sizes:OR'ed value of all CPU chunk allocation sizes. (uint)
parm: uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes allocated and freed, 2 = per-allocation origin tracking. (int)
parm: uvm_debug_enable_push_desc:Enable push description tracking (uint)
parm: uvm_debug_enable_push_acquire_info:Enable push acquire information tracking (uint)
parm: uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid, sys. (charp)
parm: uvm_perf_access_counter_mimc_migration_enable:Whether MIMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm: uvm_perf_access_counter_momc_migration_enable:Whether MOMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm: uvm_perf_access_counter_granularity:Size of the physical memory region tracked by each counter. Valid values asof Volta: 64k, 2m, 16m, 16g (charp)
parm: uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a notification.Valid values: [1, 65535] (uint)
parm: uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0. (int)
parm: uvm_exp_gpu_cache_peermem:Force caching for mappings to peer memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm: uvm_exp_gpu_cache_sysmem:Force caching for mappings to system memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm: uvm_enable_va_space_mm:Set to 0 to disable UVM from using mmu_notifiers to create an association between a UVM VA space and a process. This will also disable pageable memory access via either ATS or HMM. (int)
parm: uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int)
parm: uvm_peer_copy:Choose the addressing mode for peer copying, options: phys [default] or virt. Valid for Ampere+ GPUs. (charp)
parm: uvm_debug_prints:Enable uvm debug prints. (int)
parm: uvm_enable_builtin_tests:Enable the UVM built-in tests. (This is a security risk) (int)
$ sudo modprobe nvidia-uvm
modprobe: ERROR: could not insert 'nvidia_uvm': Operation not permitted
looks like a permission issue to me so. I could understand if I would run this as non-root user but like that I don’t really understand…
You have secure boot enabled but for some reason, the nvidia-uvm module doesn’t get signed on install.
On second look, it seems to be signed.
I tried to follow the installation procedure and there is a step where an outdated signing key is replaced (see Network repo installation).
But, for debug purpose: Do you think disabling secure boot should do the trick?
I can confirm that after disabling secure boot, mnistCUDNN & deviceQuery return PASS / Test passed.
I still have an error with OpenCV but it’s a different message & is OpenCV related → different story and does not belong to this forum.
I’d still be interested in the cause of the issue since I can’t see what I did wrong but it solves the problem for the moment.
Thanks for your help!
I’m also a bit puzzled, if the signing key was invalid, then the other nvidia modules shouldn’t load as well. Please check modinfo nvidia and compare the key fingerprints to make sure the same key was used.
sig_id, signer & sig_key shown by modinfo are the same for nvidia & nvidia-uvm. Is there another parameter I should check?
The modules are auto-signed by dkms with the key created when Ubuntu was initially installed. So nothing for you to do wrong. Maybe rather report this to the Ubuntu bug tracker, I can’t really think of a reason for the uvm module being invalid. I’d expect if modinfo displays the key, the keys are the same, it should work.
Thanks for the additional feedback. I’ll try to bring this to the attention of the Ubuntu community then.
Looks like there have been similar issues with previous versions:
I have added a comment with a link to this discussion as an attachment to this bug report.
Thanks again for your support!
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.