PCIE DMA Problem between TX2 & FPGA

Hi, all.
i am recently working around an “FPGA + GPU” platform, where FPGA and DSP are connected through the PCIE Gen2 X4 bus.

by executing

              'lspci -vv'

we observed our FPGA (a pcie endpoint device),

               Xilinx Memory Controller, 7024, 10EE
               link cap = Gen2 x4 MaxPayloadSize = 128B
               ... ...

this implies the FPGA has been recognized by the TX2 through the PCIE Gen2 x4 bus.

but during the 'DRiver DEvelopment ’ we have encountered a problem on PCIE Master Write,

i.,e.,
FPGA (obviously, the DMA MAster) actively writes to the GPU (tx2).

Thw workflow is as follows:

I.   in the driver, we allocated a DMA consistent buffer via (Linux DMA API function)

               virAddr = pci_alloc_consistent(pdev, 4096, &busAddr);
     
     where,
               'virAddr' is the kernel virtual address
               'pdev'    is the pointer to the device data structure (representing the PCI device, i.e., the FPGA).
               '4096'    the DMA test assumes 4096-byte Master DMA Write transaction.
               'busAddr' is the container keeping the BUS address.

     please note, our device uses 64-bit address for the Master DMA Write transaction.
     and in the earlier part of our driver probing procedure, we have also passed the call to
   
               pci_set_dma_mask(pdev, DMA_BIT_MASK(64))

     which means, the TX2 ARCH allows for 64bit addressing with our device.

     please also note a STRange problem, the BUS address returned by TX2 is always '0x0000-0000-8000-0000'
     despite the methods we've chosen for DMA buffer allocation. In fact, we tried alot of alternatives including
               
               a:
               virAddr = __get_free_pages(GFP_KERNEL, 0);
               busAddr = pci_map_single(pdev, virAddr, 4096, PCI_DMA_FROMDEVICE);

               b:
               virAddr = pci_alloc_consistent(pdev, 4096, &busAddr);

     but, in either case, the returned bus address 'busAddr' is always a constant value of '0x0000-0000-8000-0000'
     Intuitively, i thinks there is a problem with this phenomenon.

II.  Pass the returned 'busAddr' (as u64) and the 'length in byte' (4096 in our test) to the FPGA through PCI-BAR-0 
     memory region. We have successfully observed that the corresponding registers in the FPGA held the requested
     values (i.e., the configured 'busAddr', and 'length in byte').

III. Then start the Master DMA Write (a sequence of repeated accumulate numbers: 0x0, 0x1, 0x2, 0x3, ..., 0x1F, 
     0x0, 0x1, 0x2, 0x3, ..., 0x1F, 0x0, 0x1, 0x2, 0x3, ..., 0x1F, 0x0, 0x1, 0x2, 0x3, ...) by set the DMA-WRITE-
     ENABLE bit in our FPGA through the BAR-0 access.

IV.  We have observed, in the FPGA, that correct TLP packets (MaxPayloadSize = 128B) are generated and submitted to 
     the TX2, i.e., 

     a sequence of 4096/128 = 32 MWr packets of 128B payload as follows:

     MWr packet 00:   0x60000020 0x010000FF 0x00000000 0x80000000 0x00000000 0x00000001 ... 0x0000001F
     MWr packet 01:   0x60000020 0x010000FF 0x00000000 0x80000080 0x00000000 0x00000001 ... 0x0000001F
     MWr packet 02:   0x60000020 0x010000FF 0x00000000 0x80000100 0x00000000 0x00000001 ... 0x0000001F
     ...
     MWr packet 31:   0x60000020 0x010000FF 0x00000000 0x80000F80 0x00000000 0x00000001 ... 0x0000001F

     where, 0x60000020 indicates that each MWr packet is a Memory Write (FPGA writes data to TX2's memory)
     packet of
           
           #1: 128 bytes (20 means, 0x20 32bit words, i.e., 128B, as noted in the PCIE spec v2.1)
           
           #2: 4DW header (i.e., TLP header is composed of 4 32bit word), i.e., 
                     0x60000020 0x010000FF 0x00000000 0x8000XXXX
               where XXXX represents the varing address offset of each MWr packet

           #3: 6, or in binary form '0-11-00000' represents that the TLP has an 4DW header and has payload data
               which accords to #1 and #2
   
           #4: 0100 corresponds to the 'bus number & device number & function number', which is verified
               to be correct, otherwise, the 'busAddr' and 'length in byte' can not be configured through
               BAR-0 access at all (we issue reads to these registers, which will cause the FPGA to send
               back the register contents using Cpld TLP packet, and this will use the joint of
               'bus number & device number & function number'. Since we successfully read the configured
               values, 0010 must be valid and correct)

     in each MWr packet, 0x00000000 0x00000001 ... 0x0000001F, are generated as the 128B payload

     It is obviously seen that, the FPGA works fine.


V.   BUT it turned out that TX2 hung itself soon once the FPGA initiates the DMA Master Write as enabled by the  
     Linux driver through the BAR-0 register access (set the DMA-Master-Write start bit to 1).

So, can anyone help me with this problem ? It does not make sense at all…

Oops, i found there is a “Logic-Timing-Bug” in the FPGA design, now i’ve fixed the bug, and the DMA Master Write Transaction seems to work fine while running continuously.

BUT, i have encountered another problem:


 our DMA buffer is allocated this way:

     virAddr = pci_alloc_consistent(pdev, 4096, &busAddr);

 and FPGA writes consecutive numbers: 

     0x0, 0x1, 0x2, 0x3, ..., 0x1F,
     0x0, 0x1, 0x2, 0x3, ..., 0x1F,
     0x0, 0x1, 0x2, 0x3, ..., 0x1F,
     0x0, 0x1, 0x2, 0x3, ...
     ...

     To 'busAddr'

 upon the completion of DMA Master Write, we performed a check over the allocated DMA buffer, i.e.,

     ((u64*)virAddr)[0],
     ((u64*)virAddr)[1],
     ((u64*)virAddr)[2],
     ...
 They turned out to be all 0....

Should it be caused by some CACHE Coherency issue?
If SO, what might be the solution ?

If NOT, what are the correct measurements to take for allocating DMA buffers and obtaining the ‘PCI-BUS-Address’
that corresponds to the obtained buffer?

Bus address for PCIe is reserved from 0x8000_0000 to 0xFFF0_0000. So, whether the memory is allocated directly or mapped later on, iova address always comes from this region. So, your observation is perfectly fine.
Coming to the second part, how did you get ‘pdev’ pointer in the above case?

hi, if SMMU is disabled, what will happen?

can we use what is returned from virt_to_phys(__get_free_pages()) as the bus address?

and how can we make sure virt_to_phys(__get_free_pages()) to be 64bit address ?

hi, if SMMU is disabled, what will happen?
In that case both IOVA and Physical address would be same

can we use what is returned from virt_to_phys(__get_free_pages()) as the bus address?
Yes.

and how can we make sure virt_to_phys(__get_free_pages()) to be 64bit address ?
I don’t think that is possible, because, if free pages are available in < 4GB region, then the address returned by get_free_pages() is going to be a 32-bit address. BTW, any specific requirement that the address has to be a 64-bit address? I mean, if your DMA is able to deal with 64-bit addresses, it can as well deal with 32-bit addresses. Am I missing anything here?

Hi, vidyas manito:

1, could U please point out which DTS file shall be modified for disabling the SMMU ?
and which lines to mask (or remove) ?

2, right, we have to use 64bit DMEA address, since the PCIE TLP (Transaction Layer Packet) requires that
the upper 32bit shall be non-zero once 64bit addressing is used, this compulsory as per the PCIE spec
e.g.,
address like: 0x0000-0000-1234-5678
can not use 64bit addressing to form a TLP

address like: 0x0010-0000-XXXX-XXXX or 0x0001-0010-XXXX-XXXX
as long as the upper 32bit is non zero

can be used to construct TLP using 64bit addressing (also known as 4DW (4 32bit word) memory request header)

From thie point, we have to use 64bit bus address (or physical address when SMMU disabled) since our hardware
uses 64bit (also known as DAC, i.e., Dual Address Cycle mode, in the view of PCI/PCI-X)

3, Linux api does provide a mechanism to guarantee 64bit addressing:
call
‘pci_set_dma_mask(pdev, DMA_BIT_MASK(64))’
then
‘pci_set_consistent_dma_mask(pdev,DMA_BIT_MASK(64) )’

then it is guaranteed that:
‘pci_alloc_consistent’ will return a 64bit bus address (of ‘dma_addr_t’ type)

when SMMU is enabled, as u said, 0x8000-0000 ~ 0xFFF0-0000 will be the target range of bus address
(despite the usage of any Linux APIs like ‘pci_set_consistent_dma_mask’, ‘pci_set_dma_mask’,
‘dma_alloc_coherent’ etc…)

but once SMMU is disabled, and after successful calls of

'pci_set_dma_mask(pdev, DMA_BIT_MASK(64))'
&

‘pci_set_consistent_dma_mask(pdev,DMA_BIT_MASK(64) )’

will ‘pci_alloc_consistent’ always return 64bit bus address as expected in the linux doc “DMA-mapping.txt” ?

please help ~

For (1), following change can be used to disable SMMU for PCIe

diff --git a/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi b/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi
index 5c6536b968ab..da6eee63670e 100644
--- a/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi
+++ b/kernel-dts/tegra186-soc/tegra186-soc-base.dtsi
@@ -186,7 +186,6 @@
                   <&tegra_adsp_audio    TEGRA_SID_APE>,
                   <&{/sound}        TEGRA_SID_APE>,
                   <&{/sound_ref}        TEGRA_SID_APE>,
-                  <&{/pcie-controller@10003000} TEGRA_SID_AFI>,
                   <&{/ahci-sata@3507000}    TEGRA_SID_SATA2>,
                   <&{/aon@c160000}      TEGRA_SID_AON>,
                   <&{/rtcpu@b000000}    TEGRA_SID_RCE>,
@@ -1509,8 +1508,6 @@
         interrupt-map-mask = <0 0 0 0>;
         interrupt-map = <0 0 0 0 &intc 0 72 0x04>;// check this

-        #stream-id-cells = <1>;
-
         bus-range = <0x00 0xff>;
         #address-cells = <3>;
         #size-cells = <2>;

Regarding (2), Is your end point designed to use only 64-bit addressing for TLPs (even if the target address is only 32-bits)? even in that case, I’m not sure if spec doesn’t allow to have upper 32-bit bits as zeroes. Possible to give pointers in the spec where this is mentioned?

Regarding (3), mask APIs are to inform kernel what is the max number of address bits end point can address. I don’t think setting a 64-bit mask would guarantee that the returned address is always going to be 64-bit. It is just that system can return 64-bit addresses also for dma_* APIs and end point would still be able to work with those addresses

Regarding (3), please visit this URL:
https://forums.xilinx.com/xlnx/board/crawl_message?board.id=PCIe&message.id=4907

Thanks for your detailed answer, that is quite helpful ~

That is a forum discussion thread. It would be really great if you can point to the same in spec.

In section “2.2.4.1. Address Based Routing Rules”

of the specification “PCI Express® 2.0 Base Specification Revision 0.7”,

There is a rule:

Memory Read Requests and Memory Write Requests can use either format.
• For Addresses below 4 GB, Requesters must use the 32-bit format.

So, 4DW TLP header can be used for organizing the MWr64 request only when the “target address” is indeed
beyond the Lower 4G-Byte range.

Otherwise, one must choose 3DW header (32bit addressing) to construct MWr32 request for pushing data to TX2’s local
LPDDR4.

This conclusion has been tested in our board (FPGA <--------PCIE Gen2 x4--------> TX2)

BTW, thanks for pointing out the details for modifying the dts sources

Thanks for pointing to reference in spec

Hello, I’ve beeing working on the familiar project.I connected my TX2 with Xilinx Virtex-V7 through the PCIE x4 port. But there’s no respond to the command:
‘lspci -vv’
Should I root the ubuntu system or …
Thank you

Can you please confirm if the PCIe link is up or not? Would be great if you can share the boot log.

Dear PhoenixLee, Dear Vidyas,

we are at an early stage of this TX2 PCIe GEN2 x4 to Xilinx FPGA (Artix7) driver development work that you did about a year ago.

We have designed a carrier board for Jetson TX2 where the TX2 is connected to an Artix7 Xilinx FPGA over PCIe GEN2 x4.

“lspci” command lists the FPGA as a serial controller device:

nvidia@tegra-ubuntu:~$ lspci
00:01.0 PCI bridge: NVIDIA Corporation Device 10e5 (rev a1)
01:00.0 Serial controller: Xilinx Corporation Device 7024

However when we try to load Xilinx reference driver, it says “Error: The Kernel module installed correctly, but no devices were recognized.”

dmesg output is:

[ 2997.348703] tegradc 15210000.nvdisplay: hdmi: pclk:74250K, set prod-setting:prod_c_75M
[ 2997.408907] tegradc 15210000.nvdisplay: unblank
[ 3460.512839] xdma:xdma_mod_init: Xilinx XDMA Reference Driver xdma v2017.1.47
[ 3460.520260] xdma:xdma_mod_init: desc_blen_max: 0xfffffff/268435455, sgdma_timeout: 10 sec.
[ 3460.528930] xdma:xdma_device_open: xxx 0xffffffc0683ff700.
[ 3460.534521] xdma:xdma_device_open: xxx 0xffffffc0683ff700.
[ 3460.540096] xdma:xdma_device_open: xdma device 0000:01:00.0, 0xffffffc1deba6800.
[ 3460.547556] xdma:pci_check_extended_tag: 0xffffffc1deba6800 EXT_TAG disabled.
[ 3460.554807] xdma:pci_check_extended_tag: pdev 0xffffffc1deba6800, xdev 0xffffffc1a92ee000, config bar UNKNOWN.
[ 3460.567029] xdma:map_single_bar: BAR0 at 0x51000000 mapped at 0xffffff8011d80000, length=16777216(/16777216)
[ 3460.586990] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0020
[ 3460.597209] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
[ 3460.597289] xdma:map_single_bar: BAR1 at 0x50800000 mapped at 0xffffff80020c0000, length=65536(/65536)
[ 3460.617396] xdma:map_bars: Failed to detect XDMA config BAR
[ 3460.632305] pcieport 0000:00:01.0: device [10de:10e5] error status/mask=00004000/00000000
[ 3460.640694] pcieport 0000:00:01.0: [14] Completion Timeout (First)
[ 3460.647535] pcieport 0000:00:01.0: broadcast error_detected message
[ 3460.650568] xdma: probe of 0000:01:00.0 failed with error -22
[ 3460.659699] pcieport 0000:00:01.0: AER: Device recovery failed
[ 3460.665577] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0020
[ 3460.673653] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
[ 3460.685432] pcieport 0000:00:01.0: device [10de:10e5] error status/mask=00004000/00000000
[ 3460.693807] pcieport 0000:00:01.0: [14] Completion Timeout (First)
[ 3460.700637] pcieport 0000:00:01.0: broadcast error_detected message
[ 3460.706947] pcieport 0000:00:01.0: AER: Device recovery failed
[ 3460.712791] pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: id=0020
[ 3460.721598] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
[ 3460.733338] pcieport 0000:00:01.0: device [10de:10e5] error status/mask=00004000/00000000
[ 3460.741694] pcieport 0000:00:01.0: [14] Completion Timeout (First)
[ 3460.748492] pcieport 0000:00:01.0: broadcast error_detected message
[ 3460.755099] pcieport 0000:00:01.0: AER: Device recovery failed

Seems like something goes wrong after the BAR reads, (BAR0 & BAR1 seems to be read correctly, there is an erro indicaation after AR0 read as well) then the driver fails.

So Linux can detect pci device address and create BAR0 and BAR1 memory registers but then it fails.

Where might we be doing wrong? Are there any ways to debug our driver?

Looking forward to hearing from you.

Kind regards.

Hakan

For completeness

this is what we receive when we type lspci -vv:

nvidia@tegra-ubuntu:~$ sudo lspci -vv
[sudo] password for nvidia:
00:01.0 PCI bridge: NVIDIA Corporation Device 10e5 (rev a1) (prog-if 00 [Normal decode])
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 388
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
Memory behind bridge: 50800000-51ffffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Subsystem: NVIDIA Corporation Device 0000
Capabilities: [48] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/2 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [60] HyperTransport: MSI Mapping Enable- Fixed-
Mapping Address Base: 00000000fee00000
Capabilities: [80] Express (v2) Root Port (Slot+), MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0
ExtTag+ RBE+
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
Slot #0, PowerLimit 0.000W; Interlock- NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
Control: AttnInd Off, PwrInd On, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
Changed: MRL- PresDet+ LinkState+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
RootCap: CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Not Supported ARIFwd-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Kernel driver in use: pcieport

01:00.0 Serial controller: Xilinx Corporation Device 7024 (prog-if 01 [16450])
Subsystem: Xilinx Corporation Device 0007
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 388
Region 0: Memory at 51000000 (32-bit, non-prefetchable)
Region 1: Memory at 50800000 (32-bit, non-prefetchable)
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x4, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range B, TimeoutDis-, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Device Serial Number 00-00-00-00-00-00-00-00

Thank you for your help in advance

Hakan

Hi hsakman,
Have you fix this problem?

Dear Phoenixlee,
I checked with our hardware engineers and the way to interpret the spec is

<b>Memory Read, Memory Write, and AtomicOp Requests can use either format.</b>
For Addresses below 4 GB, Requesters must use the 32-bit format. The behavior of the
Receiver is not specified if a 64-bit format request addressing below 4 GB (i.e., with the
upper 32 bits of address all 0) is received.

The TLP generation logic in endpoint’s PCIe controller should be able to generate 32-bit/64-bit format TLPs according to the target address in host’s system memory that it wants to access.
In software, AFAIK, there is no way to restrict memory allocations coming only from 64-bit regions (although it is possible to restrict the memory coming only from 32-bit region by setting mask like pci_set_dma_mask(pdev, DMA_BIT_MASK(32).
Setting pci_set_dma_mask(pdev, DMA_BIT_MASK(64) would mean that allocations can come either from 32-bit region or from 64-bit regions and hardware should generate 32-bit format TLPs as long as the target address is a 32-bit address and should generate 64-bit format TLPs when the target address is a 64-bit address.
I hope this clarifies.

Thanks for the clarification.
USA is a great country.

一一一