Prevent PCIe From Reducing Gen?

The NVIDIA SMI documentation for GPU Link information - current (pcie.link.gen.current):
“The current link generation and width. These may be reduced when the GPU is not in
use.”

Is there a way to not reduce them? It seems to me that there may be some delay in switching generations. I tried setting them explicitly in BIOS to always be Gen3. When I boot, they are set to Gen3, but then some short time later they are reduced. Then they only set to Gen3 during active use. At the beginning of the transfer, there seems to be some latency that I would like to avoid. I think that some software (Nvidia driver or other) is reducing the generation.

I have set persistence mode and this is still the behavior.

Any ideas on how to permanently set pcie.link.gen.current=3?

Downgrading the PCIe link when not in use is an important integral part of the GPU’s power management, and to my knowledge there are no user-accessible knobs for changing this behavior.

How much latency? How did you measure it? How does it compare to the general non-neglible latency of PCIe transactions?

The entire design philosophy of the GPU is to maximize throughput, not to minimize latency. In fact, in various places in the GPU design latency is sacrificed to maximize throughput.

You will get the best use of GPUs if you adapt your overall design to this reality. For example, in place of many small PCIe transfers try fewer, larger transfers.

I understand that this is not possible for every use case. One case I am aware of is high-frequency financial transactions. For those problems GPU may not be the right solution. Latency-sensitive tasks are usually a better match for CPUs with high single-thread performance, implying use of the fastest clocked CPU you can afford.

Truthfully, what I am trying to do is isolate that factor (PCIe Gen Switch) as a source of slowdown in some edge cases.

It seems to be because, across several short repetitive transfers the transfer speeds are ~3, 6, and 12GBps, indicating that the switch is slow enough to not take effect until after some transfers are complete but fast enough that it can disable itself between some back-to-back transfers.

I’m aware that short repetitive transfers are also not the ideal case. This is more about studying the behavior and cost of these bad cases.

It seems like you have performed experiments. What additional overhead have you found from PCIe mode switching?

It should be easy enough to measure with a logic analyzer, using a configurable rate of transfers in your test application, so you can measure transfers in close temporal proximity (avoiding link downgrade) vs transfers further apart (allowing power management to downgrade link speed in between transfers).

My (somewhat vague) recollection is that a one-way PCIe transfer, with PCIe gen3 operating at full speed, has about 2 microseconds of basic latency. Or maybe it was more like 1 microsecond, I don’t recall to that level of detail.

What kind of numbers are you observing?

I probably did not describe the issue accurately.
To be clear, I think the Gen switch has some “upgrade latency” that causes me to sometimes get a complete transfer off before the link upgrade completes. So the “unwanted latency” I’m referring to causes a bandwidth decrease if it overlaps a data transfer.

For a transfer of 2MB, I can get measurements in 4 buckets. The ideal case is 160us, where I do NUMA pinning and the like. This is very predictable.

The other 3 buckets are on NUMA nodes non-local to the PCIe connection. I get (approx 200usecs, 250usecs, 300usecs) approximately. nvprof also reports this as (again approx) 3, 6, 12 GBps. These transfers are back-to-back 10000 transfers; also I repeat this experiment several times.

What I would like to confirm is that this variance is tied to PCIe link upgrade/downgrade (probably it is actually the fault of the DRAM/SMP interconnect and possibly some driver stuff).
This seems most likely based on the measurement values. I have already investigated that it is not tied to the Interrupt Handler - NUMA node association.

One very strange thing, to me, is that per-numa-node the AVG (3/6/12) is consistent over many trials. However, disabling and reloading persistence mode causes the average to swap to a different bucket, per NUMA node.

The “non-local” part would seem to be the proximate cause of the timing differences you are seeing, not anything to do with PCIe mode switching. Is there any reason to believe that is not the case?

I don’t know the interconnect topology of your system, its hop latencies and the like.

You would want to have a lengthy detailed discussion with an expert on PCIe with extensive experience in NUMA environments. I am not one of them.

What do you believe is the practical performance implication of your observations on the performance of your system for your (unspecified) use case? More than 5% in terms of application-level performance?

The “non-local” part was my original suspicion for the direct cause. I still believe it is related, but not that it the direct cause. This is because I can receive deterministic performance until the driver persistence is toggled. All nodes are capable of achieving all levels of performance empirically. I chose to investigate this particular direction because the buckets of performance are very precise and seem to fit into the 3-tier category.

I am interested in understanding how UVM paging seems to be able to avoid this cost (though I have collected less data on this end). Assuming UVM paid this cost, then yes, it would have a quite significant impact on application-level performance (2x or more).

Turning on driver persistency is a best practice. I don’t understand why you want to toggle it. These days persistency is implemented by means of a persistence demon.

It seems you may be on some sort of reverse-engineering quest, although it is not clear what the ultimate goal of that is. It would not be realistic to expect vendor support for reverse-engineering efforts of the vendor’s technology.

Persistence mode is toggled explicitly by me as a minimally-invasive way to demonstrate that there’s some symptomatic performance variability. It happens automatically in other cases, for example, after system reboot…

Reverse engineering isn’t really the goal and would be the hard way to do anything associated with UVM; UVM is open-source and I can (and possibly will) dig through it. My end goal is just to see if I can replicate the stable data-transfer rate that I see in UVM through policy in userspace; or find an explanation (e.g. PCIe link downgrade) that prevents it.

To provide an answer for anyone who stumbles upon this:

Your system may have ASPM enabled in BIOS, software that is managing it:

However, the link seems to automatically downgrade from the device-side when the device is downclocked. You can use nvidia-smi to set a higher clock rate if you want to take measurements without dealing with the PCIe downgrade.

This was not my issue, though. Strangely, fixing the clock/pcie link gen did not resolve my issue. Running a kernel in the background during link transfer did, though. This explains why UVM has performance (UVM device-side access implies kernel running.) The mechanism that causes this behavior is beyond me, though.

That is because both of these measures reduce GPU power consumption when no CUDA kernels are running. The GPU’s own power management reduces core clock, reduces memory clock, reduces core voltage, and downgrades the PCIe link.