I tested bandwidth of pinned memory and pageable memory in RTX4090 (Intel Core i9 13900K and 4*DDR5 RAM) and RTX4060 (Intel Core i7 12700K and 2*DDR5 RAM)..The pinned momory is ok, but why RTX4090’s pageable memory is mush slower than RTX4060.
H2D copies from pageable memory include a single-threaded copy between two buffers in host memory. The reported H2D speed cannot be faster than the intermediate copy. Since you test on two different computers, the host memory can make the difference.
In 4060’s computer the pinned memory is about 13gb/s,but pageable is 11g/s.
In 4090’s computer the pinned memory is about 24gb/s,but pageable is 6g/s.
In the scenario of pinned memory, the speed of the RTX 4090 is twice that of the RTX 4060; however, in the scenario of pageable memory, the RTX 4090’s speed is only half that of the RTX 4060.
What i comfused is that 4090’s computer have better cpu and ram than 4060’s computer, so 4090’s speed shouldn’t slower than 4060’s.
Could you share some possible reasons why host memory is relatively slow?
You could create pinned host memory and manually copy the pageable memory to those buffers first.
Perhaps this gives an idea, why it is slower on your 4090. I don’t know why, but could also be related to buffer sizes. Perhaps the manual approach will give back the 4060 speed on the 4090.
The throughput data using pageable memory from the machine with the RTX 4090 looks implausibly low. Question for OP: The output from NVIDIA’s example app shows the GPU as an RTX 4090 D. What is the difference between that and a regular RTX 4090?
On modern PCs, the system memory bandwidth tends to exceed PCIe bandwidth by factors, and thus host/device transfers involving pageable system memory should be not much slower than host/device transfers using pinned system memory, despite the second sysmem->sysmem copy involved when pageable system memory is used.
Both host systems here seem to have CPUs with two DDR5 channels per the TechPowerUp database (I am not sure why the tool shows 4 channels). They use different speed grades of DDR5, but system memory throughput should be in the 75 to 100 GB/sec range either way, significantly greater than the PCIe4 x16 uni-directional bandwidth of about 26 GB/sec.
If these were my machines, I would run a system memory bandwidth test to see what that shows. If that confirms low system memory throughput for the machine with the RTX 4090 and this system was manually tweaked, I would suggest returning to the factory defaults. I do not have hand-on experience with these particular CPUs, but my (somewhat vague) understanding is that DDR5 will re-negotiate the link to the CPUs memory controller when the timing gets off. Conceivably hand-weaking CPU settings could lead to marginal link quality requiring frequent re-negotiation.
To the best of my knowledge cpu-z lists memory frequency as 1/2 of what the memory is specified as, DDR stands for Double Data Rate. So on the 4060 equipped machine DRAM Frequency 2793.2MHz suggests that the memory is running as if it is 5600MHz DDR5 memory. Note: the real memory could be faster memory set to run slower than it could, or slower memory set to run faster than its stated spec.
I appears to me that there is something wrong on the 4090 equipped machine. DRAM Frequency 1995.1 would have the memory running as if it was 4000MHz DDR5. As far as I know the slowest DDR5 memory ever sold is 4800MHz so I think there is something very wrong with its memory setup and it will be much slower, about 30% slower.
The cas latency etc figures are basically about how quickly you will get back the data you asked for. 30 is better than 40 but wont make as much difference as the frequency.
Incorrectly setting memory specs is potentially destructive so make sure you or whoever is going to help you fix this knows what the implications are, and how to do it correctly. In theory default settings (however you get them for your 4090 system) should normally be safe and should be better than what you have.
If you can find out what modules were actually purchased it might get you somewhere to start.
From the test, it can be observed that the issue is indeed caused by host memory copying. let me check the hardware and its setting first, hoping can find the reason.
Just a thought, how many memory slots does the 4090 systems motherboard have, and how many memory modules do you have? If the answer is that there are more slots than modules then:
Are the modules all the same? If not it will be complicated.
Do you have them in the recommended slots specified in the motherboard documentation?
The throughput numbers for the system memory reported by AIDA64 look as expected for a dual-channel configuration with the two speed grades of DDR5 used (DDR5-4000, DDR5-5600). DDR5-4000 is definitely the slowest defined (entry-level) speed grade of DDR5 memory, so not a good fit for this high-end machine.
Note that my earlier estimate of 75 - 100 GB/sec throughput was off to the high side due to an error in my mental back-of-the-envelope calculations; on review, the measured numbers from AIDA64 look entirely as expected.
However, even if we account for the rather slow system memory in the machine with the GTX 4090, that does not get us down to ~6 GB/sec for host/device transfers reported in the original post. Instead, that should drop us from 26 GB/sec seen with pinned system memory down to about 17 GB/sec with pageable system memory.
As other posters have pointed out, the host’s system memory might be misconfigured in some way that creates an outsized performance impact, but I cannot off-hand think of a specific error scenario. My recommended course of action would be to install some DDR5-5600 memory in the machine with the RTX 4090, following the guidance in the motherboard or system manual and use SBIOS default settings for the memory without any manual tweaking. One might also want to check whether a newer SBIOS version is available. At least for the brand-name machines I use, new SBIOS versions are released up to four or five years after date of purchase.
Pageable transfers are expected to vary across systems.
Pageable H2D/D2H involves an extra host-side copy through a temporary pinned buffer, so effective bandwidth is often limited by host DRAM bandwidth and configuration, not PCIe or the GPU.
If the two systems have different memory bandwidth (frequency, channels, BIOS/XMP), large differences in pageable results are normal.
Pinned memory is the appropriate metric for cross-system comparison.
I test in Shmoo Mode. The speed ratio of two computers is close to the memory frequency.But both computers have a significant difference in memory copying speed compared to the theoretical value.
4090’s computer have 4 DDR5-5600, but due to the dual channel motherboard, the frequency of the memory module has decreased to 4000. I have also tried to change the BIOS settings of this industrial computer, and the speed difference is not significant.
4060’s computer have 2 DDR5-5600, so it can keey 5600 frequency.
The speeds of the two computers are matched: 4090/ 4060 : 4000/5600 MHZ ~ 6.0 / 8.0 G/s.
That does not sound right. Two DIMMs per DRAM channel is an entirely conventional arrangement. Usually, the only drawback of populating both DIMM slots is that one cannot use a 1T command rate. That is only possible with one DIMM per channel, otherwise a 2T command rate has to be used instead. I know of no reason why using 4 DDR5-5600 DIMMs in a dual-channel system would prompt an SBIOS to drop to the lowest speed. I suspect there are other factors at play.
(1) Are all four DIMMs the exact same model (same SKU from the same vendor, ideally acquired at the same time)? Mixing different DIMM types or DIMMs from different vendors with supposedly identical specifications is not advised. System or motherboard documentation typically provides guidelines on the DIMMs and DIMM configurations that can be used (capacity, speed grades, organization, number of ranks, buffered / registered / ECC).
(2) Have you tried removing the DIMMs currently installed and replacing them with “known good” DIMMs? I wonder whether the DIMMs or the DIMM slots may be damaged or dirty, and a DIMM swap coupled with a visual inspection at the time of the swap should establish whether that is or is not the case.
(3) You could try switching to one DIMM per channel for a temporary experiment to confirm or refute the hypothesis that two DIMMs per channel are problematic on this platform.
It has already set to 2T mode. The memory modules are all from Samsung, with identical model numbers and production dates. I also tried removing two of them and testing with just two installed, but there was no noticeable difference in speed.
I think the problem is from CPU cores. If close E-cores in bios, the test speed can reach 11 GB/s, if not, the system scheduler will use E-cores to run the program,and its speed will decrease to 6GB/s.
You could now try to set the CPU affinity of the program to the performance cores only (Linux + Windows) or completely turn the E cores off during runtime (Linux).