about how to set GPU max payload size, please help me

The max payload size (packet size) is the lower of the max payload size supported by the root complex (i.e. motherboard) and the max payload size supported by the endpoint (i.e. GPU). You can inspect these values directly using lspci on linux. On the particular Dell workstation (T3500) that I happened to look at, the (root complex) max payload size was not a BIOS adjustable option (although it may be on some motherboards). Using lspci -vvvx, I could see that the max payload size supported by the root complex was 256 bytes, whereas the max supported by the GPU was 128 bytes, and so 128 bytes was the configured value.
The choice of 128 made by the GPU is probably a compromise. If there are a mix of large and small packets, choosing a very large size (like 4096, the max supported by PCIE) would provide a benefit to these large transfers but could otherwise “penalize” short message PCIE traffic.

      We are using telsa p4,i want test the best bandwidth of one p4 ,but the 128 BYTE(i want it to 256byte) max payload size is down the efficient.i try to set the Device Control register controls PCI Express device specific parameters.but  it does not work , please give a help ,thanks very much

To the best of my knowledge, the maximum PCIe payload size of the GPU is not user configurable. Why are you attempting to change this value? What real-world problem are you trying to address?

You can achieve just slightly over 12 GB/sec full duplex across PCIe gen3 x16 if you send the data in large enough blocks. With NVIDIA GPUs, this maximum achievable throughput rate is typically reached when transfer size is in the 8 MB to 16 MB region.

firstly thanks your reply very much,we are using CPU +FPGA +GPU to visit the remote GPU(another board), we want get the best bandwidth size ,the max payload size is just 128byte ,it cost to unuseful proportion, if we can set to 256byte ,the bandwidth value will become larger.can we change driver to set it ?

As I recall, the PCIe packet header is 16 bytes, so increasing the packet payload from 128 to 256 bytes would theoretically provide 5.9% more bandwidth. I have a hard time imagining that this represents a make-or-break scenario for your use case.

I am reasonably certain (it has been quite a few years since I last dealt with this in detail) that the maximum PCIe packet size of the GPU is simply a function of the hardware and cannot be changed. The value could theoretically be different for different GPU architectures, so you might want check the latest GPUs available to see whether it has been increased. I don’t think it has but can’t be sure.

Are you writing your own GPUDirect driver for the FPGA to facilitate direct DMA transfers between GPU and FPGA?

non, we just passthrough the DMA GPU, thanks, we visit remote GPU of another board. so it cost 128byte + 16byte (head) + 16byte (crc) + 24byte (mac). another question we meet.now (pcie gen3 x8) , we test the num Host to Device Bandwidth is 5.5GB / s, Device to Host Bandwidth is 6GB / s, we see the HTOD is smaller, can we optimize it? we see ervery 20us, about 160ns bridge give the back pressure .can you help me?

From observation, I know that maximum HtoD and DtoH rates sometimes differ a bit, just as you are observing. I do not know why that is, it seems to vary with GPU and host platform. Maybe a function of different amounts of buffering at the two end points? Are you using a modern high-end GPU (e.g. GTX 1080 Ti) in your tests?

BTW, your transfer rates look a little bit lower than I would expect for a x8 interface (around 6.3e9 bytes/second). That may be due to transfer sizes not being optimal?

First of all, thank you very much. My company is trying to use fpga to access the remote GPU, so we want the best performance. We tried, and now we get two problems affecting the bandwidth, one is 128byte payload. The other is the GPU tag. We get the gpu tag information by grabbing the waveform. The GPU has 256 tags, but in fact it uses 128 tags. Excuse me, can we manually set the actual number of tags used?

I don’t know what these tags are that are being referenced. Presumably some low-level PCIe mechanism (e.g. part of a credit scheme?). How do you know that the “GPU has 256 tags”?

As its name indicates, this forum is for CUDA programming questions, As a consequence, 99% of the readers of this forum are probably software folks. It seems you need to get in touch with someone knowledgeable about GPU hardware, and the GPU’s PCIe interface in particular. Have you considered contacting your closest NVIDIA and requesting to be put in touch with a field application engineer?

ok,thanks
The total number of tags that are known by analyzing PCIE TLP packets is 8 bits, but up to 128 on the way. We are in mainland China, we are now in the verification stage, only bought a small amount of telsa card, technical support delay, can not wait to come up and ask, is there any hardware related forums to go for help?

谢谢,通过分析PCIE TLP数据包知道的tag总共使用了8bit,但是在途最大128。我们是中国大陆地区,我们现在处于验证阶段,只购买了少量的telsa卡,技术支持延迟很大,等不及了上来问一下,有没有硬件相关的论坛可以上去求助的?

I’m encountering the same issue on my RTX 4090 where the MPS can only be set to 128. I’ve tried various configurations but haven’t been able to increase it to 256. Is there a way to adjust this setting or any specific steps I should follow to achieve this?

I am not aware that users can set the PCIe maximum payload size (MPS) used by the GPU. My understanding is that this value is fixed for a given GPU. What makes you think that this is a user-configurable parameter for NVIDIA GPUs?

  1. The PCIe capability provided by the RTX 4090 supports MPS=256, so I assumed it could be configured. However, after configuring it and analyzing the packet capture, it did not take effect.
  2. Referring to this article: PCIe Maximum Payload Size only 128bytes - #3 by sumin.lee, it seems that on Jetson, MPS can be configured, so I thought it should be possible to configure it on other NVIDIA GPUs as well.
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
		ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
		RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
		MaxPayload 256 bytes, MaxReadReq 512 bytes

In my understanding, the MPS in DevCap is a fixed value for a given GPU, MaxPayload in DevCtl is configurable but cann’t exceed the value above. Usually, users can set pci=pcie_bus_perf or pci=pcie_bus_peer2peer in bootargs to modify the MPS.
The problem we encountered is when we set MPS to 256B (pci=pcie_bus_perf), the maximum packet size we captured via pcieprotocolsuite was still 128B. Is there any way to modify the packet size to match MPS? We want to take a look if we can get better performance when MPS is set to 256. Can you give me some advices?

I can’t. This subforum deals with CUDA programming, which means it is software-centric. You, on the other hand, are looking for help with GPU hardware configuration. I looked through the list of available subforums on this site and could not find one that looks like a good match for your question.