Issues porting desktop RDMA app to Tegra: mmap hangs kernel

Hi,

I’m having some troubles porting my GPU RDMA application to Tegra (on a Xavier AGX). Using a discrete GPU, I was successfully allocating GPU memory using cuMemAlloc, passing it to the kernel where I pin the memory (nvidia_p2p_get_pages) and make it accessible to the device I want to DMA to/from (nvidia_p2p_dma_map_pages). I then use those DMA addresses (dma_mapping->dma_addresses[page]) with my device’s DMA engine, while also using the GPU memory’s physical addresses (page_table->pages[page]->physical_address) to set-up a userspace mapping for application compatibility reasons.

To port this to Tegra, I followed the docs in changing the allocation to a cuMemAllocHost and adapting use of the RDMA APIs. I couldn’t seem to access the page’s physical addresses anymore, so I’m using dma_to_phys to convert the handles (now in dma_mapping->hw_address[page]) to physical addresses; these addresses look OK (identical to the DMA handles, but I assume that’s to be expected). However, passing these physical addresses to remap_pfn_range when setting up the mmap instantly hangs the kernel without any debug message. The documentation doesn’t mention any incompatibility, only that vm_page_prot should be adjusted when using cudaHostAllocWriteCombined, which I’m not.

Any thoughts on what might be the issue here?

Hi,

Below is an example for RDMA on Jetson.
Would you mind checking it to see if it can meet your requirement?

Thanks.

Yes, I’m already looking at that example code (which was very handy in porting my use of the RDMA APIs). However, my problem isn’t with the RDMA APIs, it’s with the mmap I do on the physical addresses I get out of it (see xtrx_julia/main.c at 07e7985d0c8a3a5ceecd7d794e8fd075941f87e5 · JuliaComputing/xtrx_julia · GitHub). The example code doesn’t do any userspace mapping.

Hi,

Would you mind sharing the source to reproduce this issue in our environment?
We want to check this further to see if any update is required.
Since the API from host and Jetson is slightly different.

Thanks.

It’s not trivial to isolate the code to something that can be executed without our target device. I can try though.

In the mean time, I’ve stumbled on an issue that might explain crashes. So for a given GPU allocation from user space, I both need the DMA bus addresses for use with my device, and physical addresses for use with mmap. On non-Tegra hardware, I iterate the entries from nvidia_p2p_get_pages and nvidia_p2p_dma_map_pages together, since I always seem to get the same number of entries. I can then easily get the DMA address by looking at dma_mapping->dma_addresses[...] and the physical address from page_table->pages[...]->physical_address.

On Tegra, I’m not getting the same number of entries in the page table and DMA mapping (e.g., for a 4MB allocation I get 1024 page table entries and 2 dma mapping entries). However, I’m not sure how to get the correct DMA handles and physical addresses from either. I’ve tried two approaches:

  • iterate the page table, get phys_addr using page_to_phys, get dma_addr using phys_to_dma
  • iterate the dma mapping, get dma_addr directly from hw_address, get phys_addr using dma_to_phys

Both these approaches yield different addresses, although within each approach the DMA addresses and physical addresses are always equal to each other. The fact that there is a difference means that I’m probably doing something wrong though, and using those wrong addresses with mmap is the likely to crash the system.

Any thoughts? What’s the correct way of getting valid DMA bus addresses as well as physical addresses for use with mmap out of the NVIDIA P2P APIs?

@AastaLLL Any thoughts on my latest post?

Hi,

Sorry for the late update.
We are checking this internally and share more information with you soon.

Thanks.

Hi,

Thanks for your patience.
Here are some of the suggestions:

1.

The input physical address for io_remap_pfn_range should be address of struct page on Jetson.
Please ensure to compile the sources with the Jetson version of nv-p2p.h.

Something like:

#if (defined(CONFIG_ARM64) && defined(CONFIG_ARCH_TEGRA))
    dmachan->reader_addr[j] = page_to_pfn((struct page*)nvp);
#else
    dmachan->reader_addr[j] = (uint32_t*)(nvp->physical_address + offset);
#endif

It looks like the nv-p2p.h header has different page_table struct fields for desktop and Jetson:

Desktop:
typedef
struct nvidia_p2p_page {
    uint64_t physical_address;
    union nvidia_p2p_request_registers {
        struct {
            uint32_t wreqmb_h;
            uint32_t rreqmb_h;
            uint32_t rreqmb_0;
            uint32_t reserved[3];
        } fermi;
    } registers;
} nvidia_p2p_page_t;

typedef
struct nvidia_p2p_page_table {
    uint32_t version;
    uint32_t page_size; /* enum nvidia_p2p_page_size_type */
    struct nvidia_p2p_page **pages;
    uint32_t entries;
    uint8_t *gpu_uuid;
} nvidia_p2p_page_table_t;
Jetson:
typedef struct nvidia_p2p_page_table {
    u32 version;
    u32 page_size;
    u64 size;
    u32 entries;
    struct page **pages;
  
    u64 vaddr;
    u32 mapped;
  
    struct mm_struct *mm;
    struct mmu_notifier mn;
    struct mutex lock;
    void (*free_callback)(void *data);
    void *data;
} nvidia_p2p_page_table_t;

2.

Ensure to follow “Modification to Kernel API” from https://developer.nvidia.com/blog/gpudirect-rdma-nvidia-jetson-agx-xavier/.
The mapping size should be multiple of 4K, Write combine requirement while remapping.

Thanks.

Thanks for your reply.

I’ve read that blog post, and have adapted my code (using cuMemAllocHost, not setting write-combined so not having to do anything while remapping).

I need the physical address, not the PFN, since I correct for that when mapping (shifting addresses by PAGE_SHIFT). But since you mention page_to_pfn here, I understand that my use of page_to_phys is correct, and that I shouldn’t be doing the inverse (recovering the physical address from the hw_address in the dma mapping). I’m left wondering then if and how I should use the dma mapping from nvidia_p2p_dma_map_pages, which gives me far fewer entries than the page table contains (see previous post). Is this API not supported on Tegra, and in fact, why should I ever use it instead of just doing phys_to_dma on the physical addresses from the page table?

Thanks for your reply.

We are checking this with our internal team.
Will share more information with you later.

Hi,

Below is some advice for you.

Please do this to get the physical address:

#if (defined(CONFIG_ARM64) && defined(CONFIG_ARCH_TEGRA))
    dmachan->reader_addr[j] = (uint32_t*)(page_to_pfn((struct page*)nvp) << PAGE_SHIFT) + offset;
#else
    dmachan->reader_addr[j] = (uint32_t*)(nvp->physical_address + offset);
#endif

Note the differences in the nvidia_p2p_page_table_t for desktop and Jetson.
Check nv-p2p.h for both. On Jeston, there is no struct nvidia_p2p_page.

Physical addresses and DMA addresses may not be one-to-one.
It may be that the iommu is mapping multiple physical addresses to single DMA range.

Check here for how dma_map gets used in picoevb-rdma sample:

Thanks.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.