Hi @Robert_Crovella, thanks for your quick answer.
I followed the link you mentioned, trying mmap(2) with MAP_LOCKED (next to MAP_SHARED). Unfortunately this does not change the situation.
However, for whatever reason, I’m seeing a massive amount of dump/BUG messages via dmesg today, which I must have missed (or did not come out the other day, can’t say). Just at the start of it, the first few lines (with and without MAP_LOCKED):
[ 125.924304] BUG: unable to handle page fault for address: 00000000e0000000
[ 125.924307] #PF: supervisor read access in kernel mode
[ 125.924308] #PF: error_code(0x0000) - not-present page
[ 125.924308] PGD 449101067 P4D 449101067 PUD 0
[ 125.924310] Oops: 0000 [#1] SMP NOPTI
[ 125.924312] CPU: 0 PID: 2410 Comm: bandwidthTest Tainted: P OE 5.4.0-74-generic #83~18.04.1-Ubuntu
[ 125.924313] Hardware name: System manufacturer System Product Name/PRIME A320I-K, BIOS 1820 09/12/2019
[ 125.924442] RIP: 0010:os_lookup_user_io_memory+0x3a5/0x430 [nvidia]
[ 125.924446] Code: 0f 88 8f fd ff ff 48 8b 4d a0 48 8b 45 c8 4c 8b 45 98 48 c1 e0 0c 48 85 c9 48 8d 34 cd 00 00 00 00 49 89 04 c8 74 18 49 8b 10 <48> 8b 44 32 f8 48 05 00 10 00 00 48 39 04 ca 0f 85 56 fd ff ff 48
So, I guess the whole chain of calls originating at cudaHostRegister() ends up in a problem (?) in the nvidia kernel module, function os_lookup_user_io_memory().
Note that the address mentioned in the first line, the 0xe000000, is the physical address of the BAR of the custom PCIe device in question.
Doing a search for os_lookup_user_io_memory() reveals this recent thread, which seems to report an error in this very function. The description sounds quite similar to what’s my problem here. So I took the patch provided within the said thread and recompiled the nvidia kernel module (version 465.19.01 from within CUDA Toolkit 11.3.1) … and: Success! Now the cudaHostRegister() does not hang the machine anymore.
So it would be nice for the “internal bug number 3280454” to be resolved in a future driver version. :-)
Just FYI: The PCI driver for the custom PCIe device is basically doing:
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
io_remap_pfn_range(.....)
in it’s chardev’s mmap() routine. Which does not look total wrong, I think, … As I said, access from the CPU/userspace to such a mmap does work.
Another side note: I’m not sure about the hint (in your link) of having to create the CUDA context with cudaMapHost flag set. As far as I understand the CUDA runtime API hides context creation from the developer. So, if I understand correctly “host mapping” is not possible, out of the box, with runtime API? Then why does the runtime API offer these functions (cudaHostRegister()), without saying that the user has to do something special (manual context creation) to really be able to use them?