cudaHostRegister() and cudaHostRegisterIoMemory for accessing 3rd-part PCIe device

On Linux, CUDA 11.3.1, I would like to let the GPU access some memory-mapped I/O space belonging to a custom PCIe device. The documentation for the CUDA Runtime API seems to suggest to use cudaHostRegister() with cudaHostRegisterIoMemory to make the space known to the GPU.

However the call to cudaHostRegister() just blocks and a few seconds later the whole machine hosting the GPU and the custom PCIe device is frozen (hard reset/power cycle required).

The memory-mapped I/O space is retrieved via mmap(2) from a char device offered by the PCI driver for the PCIe device in question. The driver builds a proper VMA as far as I can tell => Doing accesses from the CPU on such a mmap(2)'ed I/O space is not a problem and works just fine.
The machine is running with IOMMU turned off and P2P PCIe should work on the system in question (AMD Zen+, Ryzen 3 3200G) if I’m not mistaken.

What’s the best way to debug this? Is there anything I’m missing?

cudaHostRegister is mostly a thin wrapper around OS calls. You can use strace and other similar tools to watch what is going on. If I had to guess, I would guess that this issue is an bad choice of flags passed to your initial mmap call, or flags passed to the cudaHostRegister call. But I can’t give you a recipe. There may be other things I don’t know about your I/O device that are relevant. this may be of interest, although its not identical to your case (and is a bit murky, also).

1 Like

Hi @Robert_Crovella, thanks for your quick answer.

I followed the link you mentioned, trying mmap(2) with MAP_LOCKED (next to MAP_SHARED). Unfortunately this does not change the situation.

However, for whatever reason, I’m seeing a massive amount of dump/BUG messages via dmesg today, which I must have missed (or did not come out the other day, can’t say). Just at the start of it, the first few lines (with and without MAP_LOCKED):

[  125.924304] BUG: unable to handle page fault for address: 00000000e0000000
[  125.924307] #PF: supervisor read access in kernel mode
[  125.924308] #PF: error_code(0x0000) - not-present page
[  125.924308] PGD 449101067 P4D 449101067 PUD 0 
[  125.924310] Oops: 0000 [#1] SMP NOPTI
[  125.924312] CPU: 0 PID: 2410 Comm: bandwidthTest Tainted: P           OE     5.4.0-74-generic #83~18.04.1-Ubuntu
[  125.924313] Hardware name: System manufacturer System Product Name/PRIME A320I-K, BIOS 1820 09/12/2019
[  125.924442] RIP: 0010:os_lookup_user_io_memory+0x3a5/0x430 [nvidia]
[  125.924446] Code: 0f 88 8f fd ff ff 48 8b 4d a0 48 8b 45 c8 4c 8b 45 98 48 c1 e0 0c 48 85 c9 48 8d 34 cd 00 00 00 00 49 89 04 c8 74 18 49 8b 10 <48> 8b 44 32 f8 48 05 00 10 00 00 48 39 04 ca 0f 85 56 fd ff ff 48

So, I guess the whole chain of calls originating at cudaHostRegister() ends up in a problem (?) in the nvidia kernel module, function os_lookup_user_io_memory().
Note that the address mentioned in the first line, the 0xe000000, is the physical address of the BAR of the custom PCIe device in question.

Doing a search for os_lookup_user_io_memory() reveals this recent thread, which seems to report an error in this very function. The description sounds quite similar to what’s my problem here. So I took the patch provided within the said thread and recompiled the nvidia kernel module (version 465.19.01 from within CUDA Toolkit 11.3.1) … and: Success! Now the cudaHostRegister() does not hang the machine anymore.

So it would be nice for the “internal bug number 3280454” to be resolved in a future driver version. :-)

Just FYI: The PCI driver for the custom PCIe device is basically doing:

vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
io_remap_pfn_range(.....)

in it’s chardev’s mmap() routine. Which does not look total wrong, I think, … As I said, access from the CPU/userspace to such a mmap does work.

Another side note: I’m not sure about the hint (in your link) of having to create the CUDA context with cudaMapHost flag set. As far as I understand the CUDA runtime API hides context creation from the developer. So, if I understand correctly “host mapping” is not possible, out of the box, with runtime API? Then why does the runtime API offer these functions (cudaHostRegister()), without saying that the user has to do something special (manual context creation) to really be able to use them?

1 Like

That is in progress, it will happen. I expect weeks or months, not years.

Of course, I have not tested a fixed driver against your use-case, but the work you’ve done certainly suggests some confidence.

I mentioned that things were a bit “murky” there. I would ignore that particular comment, w.r.t. the runtime API, for the reasons you’ve already stated. Of course you can do host mapping with the runtime API.