"unknown error" from CUDA 11.7 (Ubuntu 22.04 64bit)

Hello everybody,
After 5 supporters telling me they are not responsible for my problem, I try it here again:

In a nutshell:
CUDA reports “unknown error” when running the samples to test the installation.

Console output of my problem, nvcc, nvidia-smi & lsmod are stated here:
https://forums.developer.nvidia.com/t/cuda-samples-not-working-possible-installation-mistakes-help-please/214615/2?u=csommer

The log from the nvidia-bug-report.sh & the installation log is attached to support reference 220710-000470.
I can post them here too if you like but it’s quite some log so… probably just cluttering the post.

My intention is to use the CUDNN backend for OpenCV. OpenCV compiles without any problems and works up to the point where I try to use the CUDA backend.
OpenCV throws the following error at me which is in my opinion a symptom of a lower level issue (also visible from the issues stated above)

terminate called after throwing an instance of 'cv::dnn::cuda4dnn::csl::CUDAException'
  what():  OpenCV(4.6.0-dev) /home/lrmts/Downloads/OpenCV/opencv-4.x/modules/dnn/src/cuda4dnn/csl/memory.hpp:54: error: (-217:Gpu API call) unknown error in function 'ManagedPtr'

any input on where the problem might be is highly appreciated.

Thanks!

nvidia-uvm isn’t loaded. Please put it in the list of modules to load on boot or install nvidia-modprobe so normal users can load it or run deviceQuery once as root to load it.

Thanks for the quick response. So, if I understand correctly I’d need to add nvidia-uvm to /etc/modules or install nvidia-modprobe correct?

I have already run deviceQuery as root with the same result. I also have tried running nvidia-modprobe but deviceQuery still has the same issue.

Please check if the nvidia-uvm module gets loaded. I f it’s there, please create a nvidia-bug-report.log and attach.

Hi, sorry for the very long delay… I was assigned to another project the last two weeks…

based on lsmod it really looks like nvidia-uvm is not loaded:

$ lsmod|grep nvidia
nvidia_drm             69632  2
nvidia_modeset       1142784  4 nvidia_drm
nvidia              40804352  130 nvidia_modeset
drm_kms_helper        307200  2 nvidia_drm,i915
nvidia_wmi_ec_backlight    16384  0
drm                   606208  12 drm_kms_helper,nvidia,nvidia_drm,i915,ttm
wmi                    32768  3 hp_wmi,nvidia_wmi_ec_backlight,wmi_bmof

running nvidia-modprobe does not change anything.

The output from nvidia-bug-report.sh (run with sudo) is attached.

nvidia-bug-report.log (2.7 MB)

Thanks in advance for your inputs

It’s also visible in the logs, the module can’t be loaded:

systemd-udevd[473]: nvidia: Process '/sbin/modprobe nvidia-uvm' failed with exit code 1.

Please post the outputs of
sudo modinfo nvidia-uvm
sudo modprobe nvidia-uvm

$ sudo modinfo nvidia-uvm
[sudo] password for lrmts: 
filename:       /lib/modules/5.15.0-41-generic/updates/dkms/nvidia-uvm.ko
supported:      external
license:        Dual MIT/GPL
srcversion:     47ABA39EF6732B7F0C672A2
depends:        nvidia
retpoline:      Y
name:           nvidia_uvm
vermagic:       5.15.0-41-generic SMP mod_unload modversions 
sig_id:         PKCS#7
signer:         ubuntu Secure Boot Module Signature key
sig_key:        7B:90:F6:84:8E:3F:B4:11:FA:44:80:25:D8:10:52:9C:D3:46:4A:1A
sig_hashalgo:   sha512
signature:      19:03:54:BD:61:2A:66:5A:DD:05:0B:07:83:F8:E4:9D:A0:78:F3:C6:
		6E:AE:B3:23:8C:37:BA:3A:AE:D0:02:C1:A7:40:53:B4:F3:F7:A1:50:
		E4:6B:A0:FC:EE:21:80:65:82:90:6B:B9:DE:08:0F:F0:57:B4:E1:A2:
		B8:A7:CE:83:E9:57:DF:F8:5E:CB:D9:B8:7D:18:2F:45:99:FF:B3:F2:
		40:E4:80:F5:F9:55:E6:A6:44:44:13:1F:CC:27:E3:3C:8E:A3:3A:11:
		76:39:FC:4F:CB:F8:BC:EC:12:61:3F:5F:9A:F8:29:B5:62:E4:91:C6:
		9E:8A:58:30:C4:D5:AE:FE:E5:71:3C:7F:3B:8C:A1:9D:A5:6C:1E:D6:
		AA:35:08:10:B7:4F:D1:3F:E6:0A:DC:B9:27:F9:23:86:5C:93:FD:45:
		C8:6E:6D:5C:8E:8D:67:61:BA:FA:F9:93:6D:2D:EA:DD:DA:15:B6:0C:
		2C:75:28:F3:57:94:87:32:B0:43:D0:9A:0B:71:63:6C:94:62:38:D6:
		7B:0B:88:69:9B:DE:79:41:1C:EC:B8:B1:27:52:2B:AB:7B:41:7D:FF:
		EA:EF:34:68:22:32:CF:49:CF:F8:70:11:70:FE:2B:58:26:AA:49:21:
		F7:08:21:A5:37:DE:7B:D8:D2:31:0A:9E:7B:4C:3E:EE
parm:           uvm_ats_mode:Set to 0 to disable ATS (Address Translation Services). Any other value is ignored. Has no effect unless the platform supports ATS. (int)
parm:           uvm_perf_prefetch_enable:uint
parm:           uvm_perf_prefetch_threshold:uint
parm:           uvm_perf_prefetch_min_faults:uint
parm:           uvm_perf_thrashing_enable:uint
parm:           uvm_perf_thrashing_threshold:uint
parm:           uvm_perf_thrashing_pin_threshold:uint
parm:           uvm_perf_thrashing_lapse_usec:uint
parm:           uvm_perf_thrashing_nap:uint
parm:           uvm_perf_thrashing_epoch:uint
parm:           uvm_perf_thrashing_pin:uint
parm:           uvm_perf_thrashing_max_resets:uint
parm:           uvm_perf_map_remote_on_native_atomics_fault:uint
parm:           uvm_disable_hmm:Force-disable HMM functionality in the UVM driver. Default: false (i.e, HMM is potentially enabled). Ignored if HMM is not supported in the driver, or if ATS settings conflict with HMM. (bool)
parm:           uvm_perf_migrate_cpu_preunmap_enable:int
parm:           uvm_perf_migrate_cpu_preunmap_block_order:uint
parm:           uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int)
parm:           uvm_perf_pma_batch_nonpinned_order:uint
parm:           uvm_cpu_chunk_allocation_sizes:OR'ed value of all CPU chunk allocation sizes. (uint)
parm:           uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes allocated and freed, 2 = per-allocation origin tracking. (int)
parm:           uvm_force_prefetch_fault_support:uint
parm:           uvm_debug_enable_push_desc:Enable push description tracking (uint)
parm:           uvm_debug_enable_push_acquire_info:Enable push acquire information tracking (uint)
parm:           uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid, sys. (charp)
parm:           uvm_perf_access_counter_mimc_migration_enable:Whether MIMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_momc_migration_enable:Whether MOMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_batch_count:uint
parm:           uvm_perf_access_counter_granularity:Size of the physical memory region tracked by each counter. Valid values asof Volta: 64k, 2m, 16m, 16g (charp)
parm:           uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a notification.Valid values: [1, 65535] (uint)
parm:           uvm_perf_reenable_prefetch_faults_lapse_msec:uint
parm:           uvm_perf_fault_batch_count:uint
parm:           uvm_perf_fault_replay_policy:uint
parm:           uvm_perf_fault_replay_update_put_ratio:uint
parm:           uvm_perf_fault_max_batches_per_service:uint
parm:           uvm_perf_fault_max_throttle_per_service:uint
parm:           uvm_perf_fault_coalesce:uint
parm:           uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0. (int)
parm:           uvm_perf_map_remote_on_eviction:int
parm:           uvm_exp_gpu_cache_peermem:Force caching for mappings to peer memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_exp_gpu_cache_sysmem:Force caching for mappings to system memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_channel_num_gpfifo_entries:uint
parm:           uvm_channel_gpfifo_loc:charp
parm:           uvm_channel_gpput_loc:charp
parm:           uvm_channel_pushbuffer_loc:charp
parm:           uvm_enable_va_space_mm:Set to 0 to disable UVM from using mmu_notifiers to create an association between a UVM VA space and a process. This will also disable pageable memory access via either ATS or HMM. (int)
parm:           uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int)
parm:           uvm_peer_copy:Choose the addressing mode for peer copying, options: phys [default] or virt. Valid for Ampere+ GPUs. (charp)
parm:           uvm_debug_prints:Enable uvm debug prints. (int)
parm:           uvm_enable_builtin_tests:Enable the UVM built-in tests. (This is a security risk) (int)

and

$ sudo modprobe nvidia-uvm
modprobe: ERROR: could not insert 'nvidia_uvm': Operation not permitted

looks like a permission issue to me so. I could understand if I would run this as non-root user but like that I don’t really understand…

You have secure boot enabled but for some reason, the nvidia-uvm module doesn’t get signed on install.

1 Like

On second look, it seems to be signed.

1 Like

I tried to follow the installation procedure and there is a step where an outdated signing key is replaced (see Network repo installation).

But, for debug purpose: Do you think disabling secure boot should do the trick?

Yes.

I can confirm that after disabling secure boot, mnistCUDNN & deviceQuery return PASS / Test passed.

I still have an error with OpenCV but it’s a different message & is OpenCV related → different story and does not belong to this forum.

I’d still be interested in the cause of the issue since I can’t see what I did wrong but it solves the problem for the moment.

Thanks for your help!

I’m also a bit puzzled, if the signing key was invalid, then the other nvidia modules shouldn’t load as well. Please check modinfo nvidia and compare the key fingerprints to make sure the same key was used.

sig_id, signer & sig_key shown by modinfo are the same for nvidia & nvidia-uvm. Is there another parameter I should check?

The modules are auto-signed by dkms with the key created when Ubuntu was initially installed. So nothing for you to do wrong. Maybe rather report this to the Ubuntu bug tracker, I can’t really think of a reason for the uvm module being invalid. I’d expect if modinfo displays the key, the keys are the same, it should work.

Thanks for the additional feedback. I’ll try to bring this to the attention of the Ubuntu community then.

Looks like there have been similar issues with previous versions:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-470/+bug/1946312
I have added a comment with a link to this discussion as an attachment to this bug report.

Thanks again for your support!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.