Why 2*RTX 2080ti run slower than 2*Tesla P100?

I wrote a cuda program that uses the unified memory addressing to run on two graphics cards. When I run it on the 2P100, it costs 113s because the load of each one is 97%, but when I run on 22080Ti, it is very slowly, the load of cards is fluctuating between 35% and 100%. I don’t use the NVLink. I don’t know what caused the difference in efficiency so much. It is because of the information interaction between the graphics cards? Both PCIEx16, 2*P100 also does not use a bridge.

Are these observations from an apples-to-apples comparison, in which only the GPUs are exchanged and the rest of the system (hardware and software) remained unchanged?

If you compare the analysis of the CUDA profiler for the application’s execution on these two configurations, are there any noticeable differences?

Without knowing what your code is doing that’s hard to answer. The P100s have much faster memory, much faster double precision units, and much better unified memory support. Any one of these things could make a difference.

not sure what you mean by that. I think we’re talking about peak theoretical of 732GB/s (P100) vs. 616GB/s (2080Ti), and I suspect if you did a “real world” comparison using e.g. bandwidthTest, that the reported numbers would be closer than that. 2080Ti seems to clock in at about 530GB/s on bandwidthTest, whereas my P100 clocks in right at 500GB/s.

Not sure what you mean by that.

I can almost always get P100s to clock in at more that 680 GB/s, although thinking about it I have access to the 16GB version which does make a difference since the 12 GB model tops out at 3/4s that speed. From what I’ve seen HBM seems to have slightly better latency on strided accesses.

As for the unified memory it may have to do with Cuda versions. The machines that I have access to with 2080s have a version where stuff crashes if your not careful with unified memory.

Although like I said, we need code to understand why one is slower than another.

I’ve never witnessed anything close to that. If you can provide a test code that demonstrates higher than ~550GB/s on a P100 I would certainly like to see it.

680GB/s is achievable on a V100 (I measure about 730GB/s with bandwidthTest).

I believe the code I used was this gpu memory benchmark. Its meant for caches but I recall running this last year for some other work and getting 600+GB/s out of main memory.


That measures on-chip memory bandwidth. It is not the same as the bandwidth to DRAM.

My understanding was that at large enough sizes all the on-chip caches would miss and everything would be fetched from DRAM. Though if that’s wrong its good to know.

That’s true. And if you look at the sample output from that benchmark, you’ll see that at the “large enough” sizes the reported benchmark score is much lower. So if you’re looking at the top of that table (~1300GB/s) that is a proxy for cache bandwidth. If you look at the bottom of that table (~150GB/s) that is getting closer to what bandwidthTest would report for a GTX480. It is still getting some benefit from the caches. You need to go to many megabytes of transferred data to minimize cache benefit, for that sort of test. At 5MB data size, that test is still getting some benefit from cache, particularly due to the design of the test.

Anyway, using that test to estimate main memory bandwidth will be difficult. I don’t believe it supports any claims about main memory bandwidth, as that is not its stated purpose, or design. This would especially be the case with newer GPUs. A 5MB test would mostly fit into the 4MB L2 cache on P100. So it does not fit your idea of a “large enough size”.

From a design perspective, the way to write a main memory bandwidth test is to read from one location and write to another. Assuming there is no repetitive character, this should not receive much benefit from caches. OTOH, if you want to write a test that attempts to measure cache behavior, you are most likely reading and writing the same locations. The design of the test is fundamentally different. Since the test you linked states quite clearly it is attempting to do cache measurement, from a design perspective it would not be a good choice to use to estimate main memory bandwidth.

My English is not very good, so I am sorry that there is something I can’t say before. I try to make things clear. I am doing computational fluid dynamics. I wrote a program. When I use a single graphics card, the 2080Ti is much faster than the P100, and the load on the graphics card is 100%. When I used the unified memory addressing technology to rewrite to dual-card operation, I improved nearly twice as much on two P100s, and the load on both cards was stable at 97%. But when I used two 2080ti, the speed became very slow and the load fluctuated between 37% and 100%. There is dual card data interaction in the middle of the program, my environment is vs2013+cuda10. I used Intel Xeon processors for comparison, all of which are PCIEx16.

Does the P100 use the TCC driver on Windows whereas the RTX 2080Ti uses a WDDM driver? That could explain the performance differences.

this site explains the difference between the two driver models

If your code is not too Windows specific, you could attempt to repeat your comparison on Linux.

Did you compile the code to be specific to each architecture? Are the configurations of the machines the same other than the GPUs?

There are still quite a few variables in your configuration.

What kind of work does your CFD code do? How much faster is the single 2080 Ti run over the P100?

I can think of a few cases where the 2080 would be faster but those difference should be amplified with two cards.

Have you verified that the cards are actually running at PCIe 3 16x? On some motherboards one of the slots shares its PCIe lanes. One of my current machines will drop the slot to 8x or 4x if there’s another PCIe card present. You may want to verify that you can transfer data to each card at full speed.

Two P100s use the Dell PowerEdge R740 server, the CPU is a dual Intel Xeon 5117 Gold, two 2080Tis use the Dell PowerEdge T630 server, and the CPU is a dual E5-2660, both of which are server boards, both cards are PCIEx16.

My program simulates multiphase flow with the Lattice Boltzmann Method. The single 2080ti is 2-3 times faster than the single P100.

I have purchased the NVLink Bridge now, and I want to change it if I want to try SLI.

Since you’re using unified memory with multi-GPU, one possibility to consider is whether or not both cards in each case are capable of being in a P2P relationship with each other.

With unified memory, and multiple GPUs, when 2 cards cannot be placed in a P2P relationship with each other, then unified memory allocations will “convert” to pinned memory allocations. In general this will slow things down quite a bit.

Unless you have NVLink bridges installed, the RTX2080Ti will not support P2P in any scenario:

you can confirm this by running simpleP2P or deviceQuery CUDA sample codes.

If your dual P100 setup shows P2P capability between the 2 GPUs, this could explain a significant slowdown (going to the RTX2080Ti setup) in a multi-GPU unified memory setup.

Taking a look at the specifications you could easily be in a configuration where the cards on one server are connected with 16x links to a single CPU and on the other they could be commented with 8x links to different CPUs. Just because a system can support PCIe 3 16x doesn’t mean it does so in every configuration. You’ll have to verify that the cards are configured properly.

Traditionally, for dual CPU socket systems, it has also been important to control CPU and memory affinity, e.g. with numactl, in order to achieve consistent performance. The goal is to have each GPU “talk” to the “near” CPU and the “near” system memory.

I also seem to recall that P2P communication between GPUs is negatively impacted when the GPUs are on two different PCIe root complexes, and since each CPU provides its own PCIe root complex, this is something to watch out for in dual socket machines.

The important thing here is that this comparison is not a controlled experiment where only one variable is changed (the type of GPU), and that therefore no conclusions can be drawn as to the relative performance of the two GPUs involved.