"unknown error" from CUDA 11.7 (Ubuntu 22.04 64bit)

CSommer · July 11, 2022, 5:08pm

Hello everybody,
After 5 supporters telling me they are not responsible for my problem, I try it here again:

In a nutshell:
CUDA reports “unknown error” when running the samples to test the installation.

Console output of my problem, nvcc, nvidia-smi & lsmod are stated here:
https://forums.developer.nvidia.com/t/cuda-samples-not-working-possible-installation-mistakes-help-please/214615/2?u=csommer

The log from the nvidia-bug-report.sh & the installation log is attached to support reference 220710-000470.
I can post them here too if you like but it’s quite some log so… probably just cluttering the post.

My intention is to use the CUDNN backend for OpenCV. OpenCV compiles without any problems and works up to the point where I try to use the CUDA backend.
OpenCV throws the following error at me which is in my opinion a symptom of a lower level issue (also visible from the issues stated above)

terminate called after throwing an instance of 'cv::dnn::cuda4dnn::csl::CUDAException'
  what():  OpenCV(4.6.0-dev) /home/lrmts/Downloads/OpenCV/opencv-4.x/modules/dnn/src/cuda4dnn/csl/memory.hpp:54: error: (-217:Gpu API call) unknown error in function 'ManagedPtr'

any input on where the problem might be is highly appreciated.

Thanks!

generix · July 11, 2022, 5:20pm

nvidia-uvm isn’t loaded. Please put it in the list of modules to load on boot or install nvidia-modprobe so normal users can load it or run deviceQuery once as root to load it.

CSommer · July 11, 2022, 5:25pm

Thanks for the quick response. So, if I understand correctly I’d need to add nvidia-uvm to /etc/modules or install nvidia-modprobe correct?

I have already run deviceQuery as root with the same result. I also have tried running nvidia-modprobe but deviceQuery still has the same issue.

generix · July 11, 2022, 5:52pm

Please check if the nvidia-uvm module gets loaded. I f it’s there, please create a nvidia-bug-report.log and attach.

CSommer · July 22, 2022, 7:20am

Hi, sorry for the very long delay… I was assigned to another project the last two weeks…

based on lsmod it really looks like nvidia-uvm is not loaded:

$ lsmod|grep nvidia
nvidia_drm             69632  2
nvidia_modeset       1142784  4 nvidia_drm
nvidia              40804352  130 nvidia_modeset
drm_kms_helper        307200  2 nvidia_drm,i915
nvidia_wmi_ec_backlight    16384  0
drm                   606208  12 drm_kms_helper,nvidia,nvidia_drm,i915,ttm
wmi                    32768  3 hp_wmi,nvidia_wmi_ec_backlight,wmi_bmof

running nvidia-modprobe does not change anything.

The output from nvidia-bug-report.sh (run with sudo) is attached.

nvidia-bug-report.log (2.7 MB)

Thanks in advance for your inputs

generix · July 22, 2022, 9:22am

It’s also visible in the logs, the module can’t be loaded:

systemd-udevd[473]: nvidia: Process '/sbin/modprobe nvidia-uvm' failed with exit code 1.

Please post the outputs of
sudo modinfo nvidia-uvm
sudo modprobe nvidia-uvm

CSommer · July 22, 2022, 9:31am

$ sudo modinfo nvidia-uvm
[sudo] password for lrmts: 
filename:       /lib/modules/5.15.0-41-generic/updates/dkms/nvidia-uvm.ko
supported:      external
license:        Dual MIT/GPL
srcversion:     47ABA39EF6732B7F0C672A2
depends:        nvidia
retpoline:      Y
name:           nvidia_uvm
vermagic:       5.15.0-41-generic SMP mod_unload modversions 
sig_id:         PKCS#7
signer:         ubuntu Secure Boot Module Signature key
sig_key:        7B:90:F6:84:8E:3F:B4:11:FA:44:80:25:D8:10:52:9C:D3:46:4A:1A
sig_hashalgo:   sha512
signature:      19:03:54:BD:61:2A:66:5A:DD:05:0B:07:83:F8:E4:9D:A0:78:F3:C6:
		6E:AE:B3:23:8C:37:BA:3A:AE:D0:02:C1:A7:40:53:B4:F3:F7:A1:50:
		E4:6B:A0:FC:EE:21:80:65:82:90:6B:B9:DE:08:0F:F0:57:B4:E1:A2:
		B8:A7:CE:83:E9:57:DF:F8:5E:CB:D9:B8:7D:18:2F:45:99:FF:B3:F2:
		40:E4:80:F5:F9:55:E6:A6:44:44:13:1F:CC:27:E3:3C:8E:A3:3A:11:
		76:39:FC:4F:CB:F8:BC:EC:12:61:3F:5F:9A:F8:29:B5:62:E4:91:C6:
		9E:8A:58:30:C4:D5:AE:FE:E5:71:3C:7F:3B:8C:A1:9D:A5:6C:1E:D6:
		AA:35:08:10:B7:4F:D1:3F:E6:0A:DC:B9:27:F9:23:86:5C:93:FD:45:
		C8:6E:6D:5C:8E:8D:67:61:BA:FA:F9:93:6D:2D:EA:DD:DA:15:B6:0C:
		2C:75:28:F3:57:94:87:32:B0:43:D0:9A:0B:71:63:6C:94:62:38:D6:
		7B:0B:88:69:9B:DE:79:41:1C:EC:B8:B1:27:52:2B:AB:7B:41:7D:FF:
		EA:EF:34:68:22:32:CF:49:CF:F8:70:11:70:FE:2B:58:26:AA:49:21:
		F7:08:21:A5:37:DE:7B:D8:D2:31:0A:9E:7B:4C:3E:EE
parm:           uvm_ats_mode:Set to 0 to disable ATS (Address Translation Services). Any other value is ignored. Has no effect unless the platform supports ATS. (int)
parm:           uvm_perf_prefetch_enable:uint
parm:           uvm_perf_prefetch_threshold:uint
parm:           uvm_perf_prefetch_min_faults:uint
parm:           uvm_perf_thrashing_enable:uint
parm:           uvm_perf_thrashing_threshold:uint
parm:           uvm_perf_thrashing_pin_threshold:uint
parm:           uvm_perf_thrashing_lapse_usec:uint
parm:           uvm_perf_thrashing_nap:uint
parm:           uvm_perf_thrashing_epoch:uint
parm:           uvm_perf_thrashing_pin:uint
parm:           uvm_perf_thrashing_max_resets:uint
parm:           uvm_perf_map_remote_on_native_atomics_fault:uint
parm:           uvm_disable_hmm:Force-disable HMM functionality in the UVM driver. Default: false (i.e, HMM is potentially enabled). Ignored if HMM is not supported in the driver, or if ATS settings conflict with HMM. (bool)
parm:           uvm_perf_migrate_cpu_preunmap_enable:int
parm:           uvm_perf_migrate_cpu_preunmap_block_order:uint
parm:           uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int)
parm:           uvm_perf_pma_batch_nonpinned_order:uint
parm:           uvm_cpu_chunk_allocation_sizes:OR'ed value of all CPU chunk allocation sizes. (uint)
parm:           uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes allocated and freed, 2 = per-allocation origin tracking. (int)
parm:           uvm_force_prefetch_fault_support:uint
parm:           uvm_debug_enable_push_desc:Enable push description tracking (uint)
parm:           uvm_debug_enable_push_acquire_info:Enable push acquire information tracking (uint)
parm:           uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid, sys. (charp)
parm:           uvm_perf_access_counter_mimc_migration_enable:Whether MIMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_momc_migration_enable:Whether MOMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_batch_count:uint
parm:           uvm_perf_access_counter_granularity:Size of the physical memory region tracked by each counter. Valid values asof Volta: 64k, 2m, 16m, 16g (charp)
parm:           uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a notification.Valid values: [1, 65535] (uint)
parm:           uvm_perf_reenable_prefetch_faults_lapse_msec:uint
parm:           uvm_perf_fault_batch_count:uint
parm:           uvm_perf_fault_replay_policy:uint
parm:           uvm_perf_fault_replay_update_put_ratio:uint
parm:           uvm_perf_fault_max_batches_per_service:uint
parm:           uvm_perf_fault_max_throttle_per_service:uint
parm:           uvm_perf_fault_coalesce:uint
parm:           uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0. (int)
parm:           uvm_perf_map_remote_on_eviction:int
parm:           uvm_exp_gpu_cache_peermem:Force caching for mappings to peer memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_exp_gpu_cache_sysmem:Force caching for mappings to system memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_channel_num_gpfifo_entries:uint
parm:           uvm_channel_gpfifo_loc:charp
parm:           uvm_channel_gpput_loc:charp
parm:           uvm_channel_pushbuffer_loc:charp
parm:           uvm_enable_va_space_mm:Set to 0 to disable UVM from using mmu_notifiers to create an association between a UVM VA space and a process. This will also disable pageable memory access via either ATS or HMM. (int)
parm:           uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int)
parm:           uvm_peer_copy:Choose the addressing mode for peer copying, options: phys [default] or virt. Valid for Ampere+ GPUs. (charp)
parm:           uvm_debug_prints:Enable uvm debug prints. (int)
parm:           uvm_enable_builtin_tests:Enable the UVM built-in tests. (This is a security risk) (int)

and

$ sudo modprobe nvidia-uvm
modprobe: ERROR: could not insert 'nvidia_uvm': Operation not permitted

looks like a permission issue to me so. I could understand if I would run this as non-root user but like that I don’t really understand…

generix · July 22, 2022, 9:35am

You have secure boot enabled but for some reason, the nvidia-uvm module doesn’t get signed on install.

generix · July 22, 2022, 9:36am

On second look, it seems to be signed.

CSommer · July 22, 2022, 9:42am

I tried to follow the installation procedure and there is a step where an outdated signing key is replaced (see Network repo installation).

But, for debug purpose: Do you think disabling secure boot should do the trick?

generix · July 22, 2022, 9:43am

Yes.

CSommer · July 22, 2022, 10:02am

I can confirm that after disabling secure boot, mnistCUDNN & deviceQuery return PASS / Test passed.

I still have an error with OpenCV but it’s a different message & is OpenCV related → different story and does not belong to this forum.

I’d still be interested in the cause of the issue since I can’t see what I did wrong but it solves the problem for the moment.

Thanks for your help!

generix · July 22, 2022, 10:07am

I’m also a bit puzzled, if the signing key was invalid, then the other nvidia modules shouldn’t load as well. Please check modinfo nvidia and compare the key fingerprints to make sure the same key was used.

CSommer · July 22, 2022, 10:11am

sig_id, signer & sig_key shown by modinfo are the same for nvidia & nvidia-uvm. Is there another parameter I should check?

generix · July 22, 2022, 10:44am

The modules are auto-signed by dkms with the key created when Ubuntu was initially installed. So nothing for you to do wrong. Maybe rather report this to the Ubuntu bug tracker, I can’t really think of a reason for the uvm module being invalid. I’d expect if modinfo displays the key, the keys are the same, it should work.

CSommer · July 22, 2022, 11:17am

Thanks for the additional feedback. I’ll try to bring this to the attention of the Ubuntu community then.

Looks like there have been similar issues with previous versions:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-470/+bug/1946312
I have added a comment with a link to this discussion as an attachment to this bug report.

Thanks again for your support!

system · August 5, 2022, 11:18am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
modprobe: ERROR: could not insert 'nvidia_340_uvm' CUDA Setup and Installation	3	5954	October 4, 2016
Ubuntu 20.04 - CUDA 11.1.1: Missing nvidia-uvm Frameworks (archived) cuda	4	8492	October 12, 2021
openSUSE Tumbleweed, kernel 5.0.5-1: nvidia-uvm module 418.56 does not load - Unknown symbol __pcpu_... Linux	27	3979	October 12, 2021
Cannot install CUDA 11.8 on Ubuntu 22.04 with Nvidia A10 GPU CUDA Setup and Installation	2	2933	June 10, 2024
Failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error Linux tensorflow	8	3438	May 25, 2024
Ubuntu 12.04. Error: cudaGetDeviceCount returned 30 CUDA Setup and Installation	9	41693	October 18, 2017
problem installation cuda CUDA Setup and Installation	1	3254	June 12, 2014
Error cuInit unknown error Linux Linux	5	2470	December 8, 2020
Problem with kernel module signing while installing CUDA 8 on ubuntu 16 CUDA Setup and Installation	0	2484	November 30, 2016
Cuda 7 RPM for Centos7/RHEL7 appears to be broken. nvidia-uvm-kmod fails CUDA Setup and Installation	1	6439	March 23, 2015

"unknown error" from CUDA 11.7 (Ubuntu 22.04 64bit)

Related topics