Slow remote DMA write and read

PBang · November 13, 2018, 8:34am

Hi all,

I am working on remote DMA write and read between 2 TX2 boards each being connected to an IDT PCIe to SRIO carrier board.

DMA writing and reading worked fine when tested between two PCs.
However, using the same IDT PCIe - SRIO carrier boards with the TX2 board seems to slow DMA write and read.
Writing / reading succeeds after several hundred milliseconds.

Please note that SMMU is disabled.
Could you please let me know if this degrades TX2 performance?

From one of your discussion threads, I learned that the jetson_clocks.sh script file improves system performance.
Is it recommended to apply this script file to speed up DMA operations?

Thanks in advance !!

kayccc · November 13, 2018, 10:50pm

Hi PBang,

Yes, the jetson_clocks.sh script is suggested to maximize the system performance.
Could you share the before and after result?

Thanks

PBang · November 14, 2018, 1:29am

Hi kayccc,

Thank you for your prompt response.

Here are before and after results:

Before applying jetson_clocks.sh:

nvidia@tegra-ubuntu:~$ sudo ~ubuntu/jetson_clocks.sh --show
SOC family:tegra186 Machine:quill
Online CPUs: 0,3-5
CPU Cluster Switching: Disabled
cpu0: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200
cpu1: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=1267200
cpu2: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=1267200
cpu3: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200
cpu4: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200
cpu5: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200
GPU MinFreq=140250000 MaxFreq=1122000000 CurrentFreq=140250000
EMC MinFreq=40800000 MaxFreq=1600000000 CurrentFreq=1600000000 FreqOverride=0
Fan: speed=0

After applying sudo ./jetson_clocks.sh:

sudo ~ubuntu/jetson_clocks.sh --show
SOC family:tegra186 Machine:quill
Online CPUs: 0,3-5
CPU Cluster Switching: Disabled
cpu0: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu1: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=1267200
cpu2: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=1267200
cpu3: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu4: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu5: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
GPU MinFreq=1122000000 MaxFreq=1122000000 CurrentFreq=1122000000
EMC MinFreq=40800000 MaxFreq=1600000000 CurrentFreq=1600000000 FreqOverride=1
Fan: speed=255

There was a slight improvement in DMA operation, but it was not as fast as Ubuntu 16.04.
Please let me know if disabling of SMMU degrades TX2 performance.
Are there any other factors that will improve the performance of the system?

Thanks in advance!

vidyas · November 14, 2018, 4:35am

If any, under certain circumstances*, disabling SMMU helps to improve performance by a slight margin

- If there are too many mapping/un-mappings happening using streaming DMA APIs for small buffers which are typically observed in case of network cards, perf take a small hit as SMMU framework takes time to update its page tables.

Having said that, if the data accessed (read/write) by PCIe endpoint device is also accessed by CPU, then, keeping SMMU disabled would decrease the overall perf also. Reason being, with SMMU enabled, any allocation/mapping to be used by PCIe endpoint device are shown as cached regions to CPU and coherency is maintained at the hardware level. With SMMU disabled, allocations/mappings to be used by PCIe endpoint device are shown as uncached regions. So, if the method you are using to calculate perf involves CPU reading/writing to the memory regions that are also used by PCIe endpoint device, then, a dip in perf is expected.
Does your perf calculation method fall in later approach?
Also, what is the difference in perf between SMMU enabled and disabled cases?

PBang · November 15, 2018, 1:53am

Hi Vidyas,

Thank you for the detailed explanation.

Our test application involves remote DMA read / write to memory.
With SMMU enabled, the application could not be tested because virt_to_phys() returned wrong address.

Therefore we disabled SMMU so that physical address and bus address are the same.
Please be informed that the size of the memory pool for atomic allocations made by dma_alloc_coherent( GFP_ATOMIC) is also increased by adding CONFIG_DEFAULT_DMA_COHERENT_POOL_SIZE=33554432 to defconfig file.

Is it advisable to replace virt_to_phys() with dma_map_single() or another API to improve performance or speed up DMA operations?

Thanks in advance!

vidyas · November 15, 2018, 3:30am

Yes.
Please keep SMMU enabled and use dma_alloc_* or dma_map_* DMA APIs

With SMMU enabled, the application could not be tested because virt_to_phys() returned wrong address.
If you use dma_alloc_coherent() API, it returns both bus address (which can be given to endpoint to dump data into system memory) and CPU virtual address to let CPU access the same memory.

PBang · November 15, 2018, 10:04am

Hi vidyas,

I enabled SMMU and used dma_alloc_coherent to get kernel address and DMA address.

Here is the snippet for your reference:

res->KernelAddress = (u64)dma_alloc_coherent( &DevExt->pdev->dev, size, &res->BusAddress, GFP_ATOMIC );
if (res->KernelAddress == 0) {
}
res->PhysicalAddress = (u64)virt_to_phys( (void*)res->KernelAddress );

But to fetch PhysicalAddress, virt_to_phys() cannot be used.
I also learnt that PhysicalAddress can be calculated by adding BAR address to DMA address.
Can that be done?

Here are BAR details:

BUS/DEV/FUNC : 1 / 0 / 0
[BAR0] : 0x50800000 - 0x50880000
[BAR1] : 0x51000000 - 0x52000000
[BAR2] : 0x58000000 - 0x59000000
[BAR4] : 0x52000000 - 0x53000000

I also executed sudo nvpmodel -m 0 for improving performance. But, it didn’t matter much.
Does performance really relate to address mapping, or are there other factors?

I apologize for troubling you !!
But, please do the needful !!

vidyas · November 15, 2018, 11:18am

But to fetch PhysicalAddress, virt_to_phys() cannot be used.
True that.

BTW, what is the need to get the physical address equivalent? After all, in this case, only two entities are accessing that location… i.e.
(1) CPU - accessing locally
(2) PCIe endpoint accessing through PCIe bus
Here, (1) needs CPU virtual address and (2) needs IOVA. Can you please elaborate on the need to get physical address equivalent?

I also learnt that PhysicalAddress can be calculated by adding BAR address to DMA address. Can that be done?
There is no relation between BAR address and the address which is allocated through DMA APIs.

I’m not clear on why we are mixing BAR with addresses allocated by DMA APIs. BAR addresses are where endpoint’s internal memories are made visible to the host system, whereas memory allocated by DMA APIs is where endpoint would dump (provided the IOVA equivalent is programmed in endpoint’s registers) data or read data from.

PBang · November 16, 2018, 1:37am

Hi Vidyas,

Thanks for the detailed explanation.
Our application requires 3 kinds of memory (PCIe, Kernel and Physical).

Since there is no IOMMU on the PC, the PCIe address is the same as the physical address.
Also, in TX2, because there was a problem with SMMU and it was disabled, we explicitly assigned PCIe address to Physical.

However, there is a mismatch in the write / read operation of the remote DMA, a performance / addressing problem is suspected, we plan to use an API to calculate the physical address instead of assigning PCIe address to it.

Here is the snippet of our user application for your reference:

// Allocate a physical memory block for the Ping-Pong communication

        local.memSize = PINGPONG_DATA_SIZE;
        status = sblib_AllocMemory( hDrv, local.memSize, &local.memPhysAdrs, &local.memBusAdrs );
        if( status != SRLIB_NO_ERROR ){
                printf("Physical Memory Allocation Failed!!!\n");
                goto PP_ERROR;
        }
  
        // Map a physical memory block to virtual space

        status = sblib_MapMemory( hDrv, local.memPhysAdrs, local.memSize, (PVOID*)&local.hSharedMemory, SREB_MM_NONCACHED );
        if( status != SRLIB_NO_ERROR ){
                printf("Virtual Memory Mapping Failed!!!\n");
                goto PP_ERROR;
        }
        
        // DMA Write
        status = sblib_SrioMemDmaWriteRaw(   hDrv,
                                             DMA_WAIT_COMPLETION,
                                             partner.devId,
                                             PINGPONG_DMA_CH_0,
                                             local.memBusSrc,
                                             partner.memBusAdrs,
                                             local.memSize,
                                             0);

Also, in order to use the custom driver and application instead of the tsi721_mport driver, we modified the configuration file as follows:

arch/arm64/configs/tegra18_defconfig
+CONFIG_RAPIDIO=y
+CONFIG_RAPIDIO_TSI721=n

Before modifying the config file, from the lsmod log, noticed that the rapidio driver was used by the tsi721_mport driver.

However, after these changes, the rapidio driver is not listed in the lsmod log.
I wonder if this will cause DMA write / read problems.
Do I need to link my custom driver to rapidio before building the kernel?

Thanks in advance !!

vidyas · November 16, 2018, 5:00am

I’m not sure how is RAPIDIO driver is structured. To me, it looks like rapidio alone seems more like a framework driver than a real driver that works with a device. Probably you may have to have your own version of CONFIG_RAPIDIO_TSI721 and enable it in configs

PBang · November 16, 2018, 5:26am

Hi Vidyas,

Could you please comment on Physical address calculation?

Our application requires 3 kinds of memory (PCIe, Kernel and Physical).

Since there is no IOMMU on the PC, the PCIe address is the same as the physical address.
Also, in TX2, because there was a problem with SMMU and it was disabled, we explicitly assigned PCIe address to Physical.

However, there is a mismatch in the write / read operation of the remote DMA, a performance / addressing problem is suspected, 
we plan to use an API to calculate the physical address instead of assigning PCIe address to it.

vidyas · November 16, 2018, 6:59am

I’m assuming the following
PCIe = IOVA
Kernel = CPU kernel virtual address and
Physical = Physical address

what exactly do you mean by “we explicitly assigned PCIe address to Physical”?
Also, I’m not sure what do you mean by “API to calculate the physical address instead of assigning PCIe address to it”?
Can you please elaborate on the above?

PBang · November 16, 2018, 7:15am

what exactly do you mean by “we explicitly assigned PCIe address to Physical”?

Since we could not use virt_to_phys (), we made both PCIe and Physical addresses identical (Physical = PCIe).

Also, I’m not sure what do you mean by “API to calculate the physical address instead of assigning PCIe address to it”?

As suggested by NVIDIA, we enabled SMMU and would like to calculate physical address using IOMMU APIs not just assigning Physical = PCIe.

(May be something similar to dma_common_mmap to avoid “Unhandled fault: level 3 address size fault (0x92000043) at 0x0000007f8ef2f000”)

vidyas · November 16, 2018, 8:23am

I’m sorry but I still don’t understand the need to get hold of physical address? Why do you need it (assuming SMMU is enabled)?
If you want your PCIe endpoint to dump data, you can use IOVA and if you want CPU to access the same location, you can use CPU kernel virtual address.

PBang · November 16, 2018, 8:51am

Hi Vidyas,

We use vendor’s application which uses kernel address being mapped to physical address for DMA write / read.

Also, from cat /proc/iomem,

50800000-52ffffff : PCI Bus 0000:01
80080000-810fafff : Kernel code
8123f000-814b3fff : Kernel data
d9300000-efffffff : System RAM
f0200000-275ffffff : System RAM
276600000-2767fffff : System RAM

But the PCIe address from dma_alloc_coherent() is 80009000. Isn’t this out of range of PCI Bus which is 50800000-52ffffff.

Also, almost all the NVIDIA discussion threads recommend disabling SMMU for accurate addressing.
I would like to know if this issue still exists?

vidyas · November 16, 2018, 9:07am

The output from ‘cat /proc/iomem’ shows the MMIO regions of that respective module. So, PCIe uses 50800000-52ffffff region to map endpoint’s configuration space and BARs.
Whereas the allocation 0x80009000 you are getting with dma_alloc_coherent() is an IOVA location in system memory and this is typically given to endpoint (by probably writing this address to one of its BAR addresses) thereby letting endpoint’s DMA to dump data to this memory.
I think you are getting confused between BAR and system memory allocations.

BAR :- resource present in endpoint device and is mapped to host’s memory through an aperture meant for mapping endpoint’s BARs. Aperture 50800000-52ffffff is one such an aperture. Any read/write to this aperture generates a bus transaction to endpoint. Basically, this is a window through which you are accessing endpoint’s internal registers/memory

Output of dma_alloc_coherent:- This is a region in system memory. It can be used by CPU or by root port controller on behalf of a read/write request coming from a connected endpoint. How does endpoint know of this location in system memory? well, the driver of endpoint typically makes this allocation and inform the whereabouts of this allocation to endpoint (by probably updating its respective BAR registers)

PBang · November 19, 2018, 5:55am

Thank you for the detailed explanation.

Does it mean that 0x80009000 is valid and its not dma_alloc_coherent() issue?

80000000-d82fffff : System RAM
80080000-810fafff : Kernel code
8123f000-814b3fff : Kernel data

Is our configuration similar to the following?

RAM ----- MMU ----- PCI bridge ----- DEVICE
      ^         ^                ^
   physical  virtual            bus
   address   address          address
           (this can also be
            considered a bus
            address!)

Also, almost all the NVIDIA discussion threads recommend disabling SMMU for accurate addressing.
I would like to know if this is still recommended?

vidyas · November 19, 2018, 6:11am

80000000-d82fffff : System RAM
This might a bit misleading. This is actually where the RAM (external memory) fits it and this is the ‘physical address’ range of RAM

The allocation address 0x80009000 is actually an IOVA address and doesn’t really represent ‘physical address’ as above. It will still end up at some address in RAM something like (0xFE809000… I’m just saying)

And in your pictorial representation, address between MMU<->PCI bridge and PCI bridge<->DEVICE is same, i.e. PCI bridge doesn’t do anything to the address coming from the DEVICE. This is both ‘bus address’ as well as ‘IOVA (IO Virtual address)’

Whether or not to disable SMMU is up to your use case. But, we recommend keeping SMMU enabled, so that device can’t randomly access any address in RAM. But, if your driver is written using physical addresses etc… then, you can disable SMMU.
NOTE:- Any upstreamed driver if you see, is written using DMA APIs and would work with and without SMMU enabled.

PBang · November 19, 2018, 6:34am

But, if your driver is written using physical addresses etc… then, you can disable SMMU.

Could you please elaborate this point?
Why is disabling SMMU recommended when physical address is used?

However, I enabled SMMU and added dma_alloc_coherent (), but remote dma still failed.

drivers/pci/host/pci-tegra.c
+msi->pages = __get_free_pages(GFP_DMA32, 0); (

arch/arm64/configs/tegra18_defconfig
+CONFIG_DEFAULT_DMA_COHERENT_POOL_SIZE=33554432

kernel-dts/tegra186-soc/tegra186-soc-base.dtsi
+<&{/pcie-controller@10003000} TEGRA_SID_AFI>,

+#stream-id-cells = <1>;

I am not able to confirm if the issue is related to performance or addressing.
NOTE: If SMMU is disabled and a sleep of 1 to 2 secs is added to the application, remote DMA operations succeed.

vidyas · November 19, 2018, 7:12am

Well, typically, drivers are not supposed to be written using physical addresses directly. What it also means is that drivers should always be written using DMA APIs. All upstreamed PCIe device drivers are written this way.
With this, device driver doesn’t really have to care whether or not SMMU is enabled/disabled in a platform. In case if SMMU is enabled, bus address would be different from physical address and with SMMU disabled, bus address would equal to physical address.
But, for whatever reason, if you are using physical addresses directly (using macros like virt_to_phys() Etc…), then, having SMMU enabled wouldn’t work.
What do you mean by when you enabled SMMU, DMA still failed? SMMU is enabled by default in the release. What did you do extra to enable again?

Topic		Replies	Views
How to map kernel memory to user memory? Jetson TX2	47	9026	October 18, 2021
Altera FPGA DMA to TX2 via PCIe problem Jetson TX2	18	3610	October 18, 2021
What is the current status of PCIe DMA? Jetson TX1	21	8086	October 18, 2021
DMA from PCIe Device to Jetson Tx2 local DDR Jetson TX2	8	2411	September 11, 2018
"dma_alloc_coherent" return memory address can't to use for "virt_to_page" on Jetpack 3.2 TX1 kernel Jetson TX1	7	1966	October 18, 2021
How to disable SMMU on TX2 platform?? Jetson TX2	10	2398	October 18, 2021
iommu unhandled context fault on PCI device DMA Jetson TX2	21	9329	October 18, 2021
PCIe DMA doesn't work for L4T 24.1 Jetson TX1	36	6936	August 11, 2016
How to use DMA from PCIe Device to Jetson Tx2 local DDR Jetson TX2	6	1705	October 18, 2021
allocate DMA at L4T 28.2 Jetson TX1	8	824	October 18, 2021

Slow remote DMA write and read

Before applying jetson_clocks.sh:

After applying sudo ./jetson_clocks.sh:

Here is the snippet for your reference:

Here is the snippet of our user application for your reference:

Related topics