Slow remote DMA write and read

Hi all,

I am working on remote DMA write and read between 2 TX2 boards each being connected to an IDT PCIe to SRIO carrier board.

DMA writing and reading worked fine when tested between two PCs.
However, using the same IDT PCIe - SRIO carrier boards with the TX2 board seems to slow DMA write and read.
Writing / reading succeeds after several hundred milliseconds.

Please note that SMMU is disabled.
Could you please let me know if this degrades TX2 performance?

From one of your discussion threads, I learned that the jetson_clocks.sh script file improves system performance.
Is it recommended to apply this script file to speed up DMA operations?

Thanks in advance !!

Hi PBang,

Yes, the jetson_clocks.sh script is suggested to maximize the system performance.
Could you share the before and after result?

Thanks

Hi kayccc,

Thank you for your prompt response.

Here are before and after results:

Before applying jetson_clocks.sh:

nvidia@tegra-ubuntu:~$ sudo ~ubuntu/jetson_clocks.sh --show
SOC family:tegra186 Machine:quill
Online CPUs: 0,3-5
CPU Cluster Switching: Disabled
cpu0: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200
cpu1: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=1267200
cpu2: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=1267200
cpu3: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200
cpu4: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200
cpu5: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200
GPU MinFreq=140250000 MaxFreq=1122000000 CurrentFreq=140250000
EMC MinFreq=40800000 MaxFreq=1600000000 CurrentFreq=1600000000 FreqOverride=0
Fan: speed=0

After applying sudo ./jetson_clocks.sh:

sudo ~ubuntu/jetson_clocks.sh --show
SOC family:tegra186 Machine:quill
Online CPUs: 0,3-5
CPU Cluster Switching: Disabled
cpu0: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu1: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=1267200
cpu2: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=1267200
cpu3: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu4: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu5: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
GPU MinFreq=1122000000 MaxFreq=1122000000 CurrentFreq=1122000000
EMC MinFreq=40800000 MaxFreq=1600000000 CurrentFreq=1600000000 FreqOverride=1
Fan: speed=255

There was a slight improvement in DMA operation, but it was not as fast as Ubuntu 16.04.
Please let me know if disabling of SMMU degrades TX2 performance.
Are there any other factors that will improve the performance of the system?

Thanks in advance!

If any, under certain circumstances*, disabling SMMU helps to improve performance by a slight margin

    • If there are too many mapping/un-mappings happening using streaming DMA APIs for small buffers which are typically observed in case of network cards, perf take a small hit as SMMU framework takes time to update its page tables.

Having said that, if the data accessed (read/write) by PCIe endpoint device is also accessed by CPU, then, keeping SMMU disabled would decrease the overall perf also. Reason being, with SMMU enabled, any allocation/mapping to be used by PCIe endpoint device are shown as cached regions to CPU and coherency is maintained at the hardware level. With SMMU disabled, allocations/mappings to be used by PCIe endpoint device are shown as uncached regions. So, if the method you are using to calculate perf involves CPU reading/writing to the memory regions that are also used by PCIe endpoint device, then, a dip in perf is expected.
Does your perf calculation method fall in later approach?
Also, what is the difference in perf between SMMU enabled and disabled cases?

Hi Vidyas,

Thank you for the detailed explanation.

Our test application involves remote DMA read / write to memory.
With SMMU enabled, the application could not be tested because virt_to_phys() returned wrong address.

Therefore we disabled SMMU so that physical address and bus address are the same.
Please be informed that the size of the memory pool for atomic allocations made by dma_alloc_coherent( GFP_ATOMIC) is also increased by adding CONFIG_DEFAULT_DMA_COHERENT_POOL_SIZE=33554432 to defconfig file.

Is it advisable to replace virt_to_phys() with dma_map_single() or another API to improve performance or speed up DMA operations?

Thanks in advance!

Yes.
Please keep SMMU enabled and use dma_alloc_* or dma_map_* DMA APIs

With SMMU enabled, the application could not be tested because virt_to_phys() returned wrong address.
If you use dma_alloc_coherent() API, it returns both bus address (which can be given to endpoint to dump data into system memory) and CPU virtual address to let CPU access the same memory.

Hi vidyas,

I enabled SMMU and used dma_alloc_coherent to get kernel address and DMA address.

Here is the snippet for your reference:

res->KernelAddress = (u64)dma_alloc_coherent( &DevExt->pdev->dev, size, &res->BusAddress, GFP_ATOMIC );
if (res->KernelAddress == 0) {
}
res->PhysicalAddress = (u64)virt_to_phys( (void*)res->KernelAddress );

But to fetch PhysicalAddress, virt_to_phys() cannot be used.
I also learnt that PhysicalAddress can be calculated by adding BAR address to DMA address.
Can that be done?

Here are BAR details:

BUS/DEV/FUNC : 1 / 0 / 0
[BAR0] : 0x50800000 - 0x50880000
[BAR1] : 0x51000000 - 0x52000000
[BAR2] : 0x58000000 - 0x59000000
[BAR4] : 0x52000000 - 0x53000000

I also executed sudo nvpmodel -m 0 for improving performance. But, it didn’t matter much.
Does performance really relate to address mapping, or are there other factors?

I apologize for troubling you !!
But, please do the needful !!

But to fetch PhysicalAddress, virt_to_phys() cannot be used.
True that.

BTW, what is the need to get the physical address equivalent? After all, in this case, only two entities are accessing that location… i.e.
(1) CPU - accessing locally
(2) PCIe endpoint accessing through PCIe bus
Here, (1) needs CPU virtual address and (2) needs IOVA. Can you please elaborate on the need to get physical address equivalent?

I also learnt that PhysicalAddress can be calculated by adding BAR address to DMA address. Can that be done?
There is no relation between BAR address and the address which is allocated through DMA APIs.

I’m not clear on why we are mixing BAR with addresses allocated by DMA APIs. BAR addresses are where endpoint’s internal memories are made visible to the host system, whereas memory allocated by DMA APIs is where endpoint would dump (provided the IOVA equivalent is programmed in endpoint’s registers) data or read data from.

Hi Vidyas,

Thanks for the detailed explanation.
Our application requires 3 kinds of memory (PCIe, Kernel and Physical).

Since there is no IOMMU on the PC, the PCIe address is the same as the physical address.
Also, in TX2, because there was a problem with SMMU and it was disabled, we explicitly assigned PCIe address to Physical.

However, there is a mismatch in the write / read operation of the remote DMA, a performance / addressing problem is suspected, we plan to use an API to calculate the physical address instead of assigning PCIe address to it.

Here is the snippet of our user application for your reference:

// Allocate a physical memory block for the Ping-Pong communication

        local.memSize = PINGPONG_DATA_SIZE;
        status = sblib_AllocMemory( hDrv, local.memSize, &local.memPhysAdrs, &local.memBusAdrs );
        if( status != SRLIB_NO_ERROR ){
                printf("Physical Memory Allocation Failed!!!\n");
                goto PP_ERROR;
        }
  
        // Map a physical memory block to virtual space

        status = sblib_MapMemory( hDrv, local.memPhysAdrs, local.memSize, (PVOID*)&local.hSharedMemory, SREB_MM_NONCACHED );
        if( status != SRLIB_NO_ERROR ){
                printf("Virtual Memory Mapping Failed!!!\n");
                goto PP_ERROR;
        }
        
        // DMA Write
        status = sblib_SrioMemDmaWriteRaw(   hDrv,
                                             DMA_WAIT_COMPLETION,
                                             partner.devId,
                                             PINGPONG_DMA_CH_0,
                                             local.memBusSrc,
                                             partner.memBusAdrs,
                                             local.memSize,
                                             0);

Also, in order to use the custom driver and application instead of the tsi721_mport driver, we modified the configuration file as follows:

arch/arm64/configs/tegra18_defconfig
+CONFIG_RAPIDIO=y
+CONFIG_RAPIDIO_TSI721=n

Before modifying the config file, from the lsmod log, noticed that the rapidio driver was used by the tsi721_mport driver.

However, after these changes, the rapidio driver is not listed in the lsmod log.
I wonder if this will cause DMA write / read problems.
Do I need to link my custom driver to rapidio before building the kernel?

Thanks in advance !!

I’m not sure how is RAPIDIO driver is structured. To me, it looks like rapidio alone seems more like a framework driver than a real driver that works with a device. Probably you may have to have your own version of CONFIG_RAPIDIO_TSI721 and enable it in configs

Hi Vidyas,

Could you please comment on Physical address calculation?

Our application requires 3 kinds of memory (PCIe, Kernel and Physical).

Since there is no IOMMU on the PC, the PCIe address is the same as the physical address.
Also, in TX2, because there was a problem with SMMU and it was disabled, we explicitly assigned PCIe address to Physical.

However, there is a mismatch in the write / read operation of the remote DMA, a performance / addressing problem is suspected, 
we plan to use an API to calculate the physical address instead of assigning PCIe address to it.

I’m assuming the following
PCIe = IOVA
Kernel = CPU kernel virtual address and
Physical = Physical address

what exactly do you mean by “we explicitly assigned PCIe address to Physical”?
Also, I’m not sure what do you mean by “API to calculate the physical address instead of assigning PCIe address to it”?
Can you please elaborate on the above?

what exactly do you mean by “we explicitly assigned PCIe address to Physical”?

Since we could not use virt_to_phys (), we made both PCIe and Physical addresses identical (Physical = PCIe).

Also, I’m not sure what do you mean by “API to calculate the physical address instead of assigning PCIe address to it”?

As suggested by NVIDIA, we enabled SMMU and would like to calculate physical address using IOMMU APIs not just assigning Physical = PCIe.

(May be something similar to dma_common_mmap to avoid “Unhandled fault: level 3 address size fault (0x92000043) at 0x0000007f8ef2f000”)

I’m sorry but I still don’t understand the need to get hold of physical address? Why do you need it (assuming SMMU is enabled)?
If you want your PCIe endpoint to dump data, you can use IOVA and if you want CPU to access the same location, you can use CPU kernel virtual address.

Hi Vidyas,

We use vendor’s application which uses kernel address being mapped to physical address for DMA write / read.

Also, from cat /proc/iomem,

50800000-52ffffff : PCI Bus 0000:01
80080000-810fafff : Kernel code
8123f000-814b3fff : Kernel data
d9300000-efffffff : System RAM
f0200000-275ffffff : System RAM
276600000-2767fffff : System RAM

But the PCIe address from dma_alloc_coherent() is 80009000. Isn’t this out of range of PCI Bus which is 50800000-52ffffff.

Also, almost all the NVIDIA discussion threads recommend disabling SMMU for accurate addressing.
I would like to know if this issue still exists?

The output from ‘cat /proc/iomem’ shows the MMIO regions of that respective module. So, PCIe uses 50800000-52ffffff region to map endpoint’s configuration space and BARs.
Whereas the allocation 0x80009000 you are getting with dma_alloc_coherent() is an IOVA location in system memory and this is typically given to endpoint (by probably writing this address to one of its BAR addresses) thereby letting endpoint’s DMA to dump data to this memory.
I think you are getting confused between BAR and system memory allocations.

BAR :- resource present in endpoint device and is mapped to host’s memory through an aperture meant for mapping endpoint’s BARs. Aperture 50800000-52ffffff is one such an aperture. Any read/write to this aperture generates a bus transaction to endpoint. Basically, this is a window through which you are accessing endpoint’s internal registers/memory

Output of dma_alloc_coherent:- This is a region in system memory. It can be used by CPU or by root port controller on behalf of a read/write request coming from a connected endpoint. How does endpoint know of this location in system memory? well, the driver of endpoint typically makes this allocation and inform the whereabouts of this allocation to endpoint (by probably updating its respective BAR registers)

Thank you for the detailed explanation.

Does it mean that 0x80009000 is valid and its not dma_alloc_coherent() issue?

80000000-d82fffff : System RAM
80080000-810fafff : Kernel code
8123f000-814b3fff : Kernel data

Is our configuration similar to the following?

RAM ----- MMU ----- PCI bridge ----- DEVICE
      ^         ^                ^
   physical  virtual            bus
   address   address          address
           (this can also be
            considered a bus
            address!)

Also, almost all the NVIDIA discussion threads recommend disabling SMMU for accurate addressing.
I would like to know if this is still recommended?

80000000-d82fffff : System RAM
This might a bit misleading. This is actually where the RAM (external memory) fits it and this is the ‘physical address’ range of RAM

The allocation address 0x80009000 is actually an IOVA address and doesn’t really represent ‘physical address’ as above. It will still end up at some address in RAM something like (0xFE809000… I’m just saying)

And in your pictorial representation, address between MMU<->PCI bridge and PCI bridge<->DEVICE is same, i.e. PCI bridge doesn’t do anything to the address coming from the DEVICE. This is both ‘bus address’ as well as ‘IOVA (IO Virtual address)’

Whether or not to disable SMMU is up to your use case. But, we recommend keeping SMMU enabled, so that device can’t randomly access any address in RAM. But, if your driver is written using physical addresses etc… then, you can disable SMMU.
NOTE:- Any upstreamed driver if you see, is written using DMA APIs and would work with and without SMMU enabled.

But, if your driver is written using physical addresses etc… then, you can disable SMMU.

Could you please elaborate this point?
Why is disabling SMMU recommended when physical address is used?

However, I enabled SMMU and added dma_alloc_coherent (), but remote dma still failed.

drivers/pci/host/pci-tegra.c
+msi->pages = __get_free_pages(GFP_DMA32, 0); (

arch/arm64/configs/tegra18_defconfig
+CONFIG_DEFAULT_DMA_COHERENT_POOL_SIZE=33554432

kernel-dts/tegra186-soc/tegra186-soc-base.dtsi
+<&{/pcie-controller@10003000} TEGRA_SID_AFI>,

+#stream-id-cells = <1>;

I am not able to confirm if the issue is related to performance or addressing.
NOTE: If SMMU is disabled and a sleep of 1 to 2 secs is added to the application, remote DMA operations succeed.

Well, typically, drivers are not supposed to be written using physical addresses directly. What it also means is that drivers should always be written using DMA APIs. All upstreamed PCIe device drivers are written this way.
With this, device driver doesn’t really have to care whether or not SMMU is enabled/disabled in a platform. In case if SMMU is enabled, bus address would be different from physical address and with SMMU disabled, bus address would equal to physical address.
But, for whatever reason, if you are using physical addresses directly (using macros like virt_to_phys() Etc…), then, having SMMU enabled wouldn’t work.
What do you mean by when you enabled SMMU, DMA still failed? SMMU is enabled by default in the release. What did you do extra to enable again?