Strange behaviour of cudaHostRegister

Hello. I am trying to register some memory with CUDA.

The memory is initialised in our Linux driver with multiple calls to alloc_pages. These memory pages are then marked for use with DMA transactions with the dma_map_page function. Our FPGA will fill this memory with data when an IOCTL of the driver is called. In userspace, we use mmap to access the allocated memory.
In practice, we allocate 8 of these memories and use them as a ring buffer.

We are processing this data with an RTX 3090Ti. I want to realise better performance by registering the driver-allocated memory with the CUDA runtime API function cudaHostRegister. Unfortunately, it seems that using this function works only for the first invocation of my userspace application. I’m using it something like this:

// mmap to access kernel memory buffers
dma_buffers_ptr = (char *)mmap(NULL, dma_buffer_length_, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

// Eventually, we have a list of ptrs to the kernel's 8 buffers from the mmap interface
std::vector<char *> buf_ptrs = ...

// We can also derive the size of each buffer
std::size_t buf_siz = ...

// Register the memory
for (int i=0; i < 8; i++)
    cudaHostRegister((void *)buf_ptrs[i], buf_siz, cudaHostRegisterIoMemory);

I unregister and deallocate the memory properly.

As mentioned above, my full program will work once as expected, and only with sudo priviledges (trying without leads to cudaHostRegister failing with the error string “operation not permitted”). Attempting to re-run the program as sudo leads to cudaHostRegister failing with the error string “invalid argument”.

  1. How can I make cudaHostRegister work for multiple invocations of my program? Currently, I have to reboot my system in order to allow the program to run again.
  2. Is it possible to make cudaHostRegister work without sudo priviledges?

Any help regarding this issue would be most appreciated. Thanks in advance.

The typical use case for cudaHostRegister() is to work on an allocation returned in user-space in the user’s process, vi e.g. new or malloc. In those situations, sudo privileges are not needed.

Beyond that I don’t have any comments about the rest.

No errors are returned with unregister?
Have you confirmed that it is the register calls? And not mmap?

No errors are returned by any call to cudaHostUnregister. Same with mmap. The application is failing when I try to call cudaHostRegister.

Would you say that the use of cudaHostRegisters in the manner I’ve described is not as expected? Is it explicitly not recommended to use this function with kernel memory?

Does it also fail within one run of the program, if you mmap and later munmap and then mmap again with optional cudaHostRegister and cudaHostUnregister in between? Compared to restarting the program?

@Curefab Yes, if I mmap, then munmap, then mmap again within the same run of the program, the same failure mode is observed. I tried including the PROT_EXEC flag within the mmap call but it didn’t change anything.

Is there a difference, whetheryou do cudaHostRegister and cudaHostUnregister ater the first mmap?

@Curefab I just rewrote the user-space application to focus on allocating the memory in kernel space and pinning it to the GPU using cudaHostRegister. This is an update of what I observe including things you have suggested to try:
Scenario 1) Allocate buffers in driver, mmap buffers in userspace, cudaHostRegister each buffer: The first time the program runs, everything works as expected. The next time, everything fails. The third time, the first 6 of the 8 buffers can be cudaHostRegistered. After, each time the program runs, it will alternate between nothing getting cudaHostRegistered and 6/8 buffers working.

Scenario 2) Allocate buffers in driver, mmap to userspace, cudaHostRegister the buffers, then munmap and mmap again, then try cudaHostRegister again on the buffers: Same sort of scenario as above. The first try works, the next one doesn’t, the third one the first 6 out of 8 buffers gets registered, the fourth nothing, then it alternates between 6/8 buffers working and none working.

If I try just one of repeating mmap and cudaHostRegisters, it’s also the same behaviour of alternating between 6/8 buffers and none getting registered.

I’m not sure what insight can be drawn from these tests. It’s interesting that 6 buffers can be allocated on every other attempt at using the program. It’s also interesting that every other run of the program will not have any buffers getting registered. Any insight you might have would be really valuable at this stage.

What is the size and physical addresses of the buffers (relation to small 4K and huge 4M pages)? Which six work sometimes? Do different ones work, if you change the order in which the buffers are registered and unregistered? How does the program react, if you only register and unregister the first buffer? Still alternating?

I would try to:

Scenario 1b) Allocate buffers in driver, mmap buffers in userspace, cudaHostRegister each buffer, cudaHostUnregister all, munmap, and loop the program without restarting the process.

Scenario 2b) Allocate buffers in driver, mmap to userspace, then munmap and mmap again, then try cudaHostRegister on the buffers. Does the first try work?

Firstly, a correction. If the program can register any buffer, it will fail on the first and last buffer of the 8, but succeed on the rest. So, the 1st and 8th buffers are never getting allocated.

  1. The physical memory pages are all 4k-aligned.
  2. Trying to register different buffers, or just one buffer, doesn’t change the behaviour I described. If the program can register any of the buffers, it always fails on the first and last one, and is successful with the rest.

Second, I tried your suggested scenarios:
Scenario 1b) The first time I ran the program, nothing worked (I repeated the main program loop twice). The second time I ran the program, both loops of the program managed to allocate buffers 2 to 7, as I observed before. So, the same overall pattern is there, where each program invocation alternates between buffers 1 through 7 getting registered, and nothing getting registered.
Scenario 2b) The same pattern is ocurring where each program invocation alternates between buffers 1 through 7 getting registered, and nothing getting registered.

I had an idea and followed it up, seems I’ve made a breakthrough. For each of the 8 buffer I’m referring to in user-space, there are 16 pages allocated in the driver using alloc_pages. So, I’ve tried running cudaHostRegister on each of the 128 pages. The kernel pages are contiguous when mmaped into userspace. It seems that this works, I can register all 128 pages with CUDA consistently, on each invocation of the program. I’ll report back when I’ve confirmed this works in the original program.

@spez_1998 I am happy that you seem to have found a way. Your results looked really strange and difficult to find a theory of operation behind those. Looking forward to your report from the usage in the original program.

I can confirm that calling cudaHostRegister on each kernel page that’s mmaped into user-space code works successfully. I can cudaMemcpy all 128 buffers over to my GPU, so I assume they may also be processed and cudaMemcpyed back.
Thanks @Curefab for your tips and @Robert_Crovella also!

That is good news!

For other people with the same issue: Can you now copy a whole buffer (with 16 pages) with one call of cudaMemcpy or do you have to copy 16 separate times individual 4KB pages?