Why 2RTX 2080ti run slower than 2Tesla P100？

2331237168 · July 3, 2019, 12:04pm

I wrote a cuda program that uses the unified memory addressing to run on two graphics cards. When I run it on the 2P100, it costs 113s because the load of each one is 97%, but when I run on 22080Ti, it is very slowly, the load of cards is fluctuating between 35% and 100%. I don’t use the NVLink. I don’t know what caused the difference in efficiency so much. It is because of the information interaction between the graphics cards？ Both PCIEx16, 2*P100 also does not use a bridge.

njuffa · July 3, 2019, 1:01pm

Are these observations from an apples-to-apples comparison, in which only the GPUs are exchanged and the rest of the system (hardware and software) remained unchanged?

If you compare the analysis of the CUDA profiler for the application’s execution on these two configurations, are there any noticeable differences?

ddpruitt · July 4, 2019, 1:28am

Without knowing what your code is doing that’s hard to answer. The P100s have much faster memory, much faster double precision units, and much better unified memory support. Any one of these things could make a difference.

Robert_Crovella · July 4, 2019, 1:58am

not sure what you mean by that. I think we’re talking about peak theoretical of 732GB/s (P100) vs. 616GB/s (2080Ti), and I suspect if you did a “real world” comparison using e.g. bandwidthTest, that the reported numbers would be closer than that. 2080Ti seems to clock in at about 530GB/s on bandwidthTest, whereas my P100 clocks in right at 500GB/s.

Not sure what you mean by that.

ddpruitt · July 4, 2019, 2:10am

I can almost always get P100s to clock in at more that 680 GB/s, although thinking about it I have access to the 16GB version which does make a difference since the 12 GB model tops out at 3/4s that speed. From what I’ve seen HBM seems to have slightly better latency on strided accesses.

As for the unified memory it may have to do with Cuda versions. The machines that I have access to with 2080s have a version where stuff crashes if your not careful with unified memory.

Although like I said, we need code to understand why one is slower than another.

Robert_Crovella · July 4, 2019, 2:13am

I’ve never witnessed anything close to that. If you can provide a test code that demonstrates higher than ~550GB/s on a P100 I would certainly like to see it.

680GB/s is achievable on a V100 (I measure about 730GB/s with bandwidthTest).

ddpruitt · July 4, 2019, 2:38am

I believe the code I used was this gpu memory benchmark. Its meant for caches but I recall running this last year for some other work and getting 600+GB/s out of main memory.

https://github.com/ekondis/gpumembench

Robert_Crovella · July 4, 2019, 2:45am

That measures on-chip memory bandwidth. It is not the same as the bandwidth to DRAM.

ddpruitt · July 4, 2019, 2:51am

My understanding was that at large enough sizes all the on-chip caches would miss and everything would be fetched from DRAM. Though if that’s wrong its good to know.

Robert_Crovella · July 4, 2019, 3:01am

That’s true. And if you look at the sample output from that benchmark, you’ll see that at the “large enough” sizes the reported benchmark score is much lower. So if you’re looking at the top of that table (~1300GB/s) that is a proxy for cache bandwidth. If you look at the bottom of that table (~150GB/s) that is getting closer to what bandwidthTest would report for a GTX480. It is still getting some benefit from the caches. You need to go to many megabytes of transferred data to minimize cache benefit, for that sort of test. At 5MB data size, that test is still getting some benefit from cache, particularly due to the design of the test.

Anyway, using that test to estimate main memory bandwidth will be difficult. I don’t believe it supports any claims about main memory bandwidth, as that is not its stated purpose, or design. This would especially be the case with newer GPUs. A 5MB test would mostly fit into the 4MB L2 cache on P100. So it does not fit your idea of a “large enough size”.

From a design perspective, the way to write a main memory bandwidth test is to read from one location and write to another. Assuming there is no repetitive character, this should not receive much benefit from caches. OTOH, if you want to write a test that attempts to measure cache behavior, you are most likely reading and writing the same locations. The design of the test is fundamentally different. Since the test you linked states quite clearly it is attempting to do cache measurement, from a design perspective it would not be a good choice to use to estimate main memory bandwidth.

2331237168 · July 4, 2019, 10:48am

My English is not very good, so I am sorry that there is something I can’t say before. I try to make things clear. I am doing computational fluid dynamics. I wrote a program. When I use a single graphics card, the 2080Ti is much faster than the P100, and the load on the graphics card is 100%. When I used the unified memory addressing technology to rewrite to dual-card operation, I improved nearly twice as much on two P100s, and the load on both cards was stable at 97%. But when I used two 2080ti, the speed became very slow and the load fluctuated between 37% and 100%. There is dual card data interaction in the middle of the program, my environment is vs2013+cuda10. I used Intel Xeon processors for comparison, all of which are PCIEx16.

cbuchner1 · July 4, 2019, 1:57pm

Does the P100 use the TCC driver on Windows whereas the RTX 2080Ti uses a WDDM driver? That could explain the performance differences.

this site explains the difference between the two driver models
https://docs.nvidia.com/gameworks/content/developertools/desktop/nsight/tesla_compute_cluster.htm

If your code is not too Windows specific, you could attempt to repeat your comparison on Linux.

ddpruitt · July 4, 2019, 4:07pm

Did you compile the code to be specific to each architecture? Are the configurations of the machines the same other than the GPUs?

There are still quite a few variables in your configuration.

What kind of work does your CFD code do? How much faster is the single 2080 Ti run over the P100?

I can think of a few cases where the 2080 would be faster but those difference should be amplified with two cards.

Have you verified that the cards are actually running at PCIe 3 16x? On some motherboards one of the slots shares its PCIe lanes. One of my current machines will drop the slot to 8x or 4x if there’s another PCIe card present. You may want to verify that you can transfer data to each card at full speed.

2331237168 · July 5, 2019, 2:03am

Two P100s use the Dell PowerEdge R740 server, the CPU is a dual Intel Xeon 5117 Gold, two 2080Tis use the Dell PowerEdge T630 server, and the CPU is a dual E5-2660, both of which are server boards, both cards are PCIEx16.

My program simulates multiphase flow with the Lattice Boltzmann Method. The single 2080ti is 2-3 times faster than the single P100.

2331237168 · July 5, 2019, 2:04am

I have purchased the NVLink Bridge now, and I want to change it if I want to try SLI.

Robert_Crovella · July 5, 2019, 2:10am

Since you’re using unified memory with multi-GPU, one possibility to consider is whether or not both cards in each case are capable of being in a P2P relationship with each other.

With unified memory, and multiple GPUs, when 2 cards cannot be placed in a P2P relationship with each other, then unified memory allocations will “convert” to pinned memory allocations. In general this will slow things down quite a bit.

Unless you have NVLink bridges installed, the RTX2080Ti will not support P2P in any scenario:
[url]https://devtalk.nvidia.com/default/topic/1046951/cuda-programming-and-performance/does-titan-rtx-support-p2p-access-w-o-nvlink-/[/url]

you can confirm this by running simpleP2P or deviceQuery CUDA sample codes.

If your dual P100 setup shows P2P capability between the 2 GPUs, this could explain a significant slowdown (going to the RTX2080Ti setup) in a multi-GPU unified memory setup.

ddpruitt · July 5, 2019, 2:19pm

Taking a look at the specifications you could easily be in a configuration where the cards on one server are connected with 16x links to a single CPU and on the other they could be commented with 8x links to different CPUs. Just because a system can support PCIe 3 16x doesn’t mean it does so in every configuration. You’ll have to verify that the cards are configured properly.

njuffa · July 6, 2019, 2:30pm

Traditionally, for dual CPU socket systems, it has also been important to control CPU and memory affinity, e.g. with numactl, in order to achieve consistent performance. The goal is to have each GPU “talk” to the “near” CPU and the “near” system memory.

I also seem to recall that P2P communication between GPUs is negatively impacted when the GPUs are on two different PCIe root complexes, and since each CPU provides its own PCIe root complex, this is something to watch out for in dual socket machines.

The important thing here is that this comparison is not a controlled experiment where only one variable is changed (the type of GPU), and that therefore no conclusions can be drawn as to the relative performance of the two GPUs involved.

Topic		Replies	Views
GTX980ti faster than RTX 2080ti? CUDA Programming and Performance	12	537	August 19, 2020
Using multiple RTX 2080 Ti cards in parallel not possible? CUDA Programming and Performance	7	4458	May 13, 2019
Bios usage of dual cards CUDA Programming and Performance	18	5153	July 16, 2014
Memory bandwidth CUDA Programming and Performance	31	38526	October 5, 2007
why gpu0 is much slower than gpu1 (v100, ubuntu 16.04)? CUDA Programming and Performance	7	2189	October 15, 2018
The fastest platform of GPU computing CUDA Programming and Performance	38	40372	January 19, 2010
GTX 980ti CUDA Programming and Performance	10	2884	June 9, 2015
P2p Bandwidth 150% higher than maximum achievable CUDA Programming and Performance cuda , ubuntu	10	2906	April 11, 2023
Slow memcpy performance in dual-CPU, 10 GPU system CUDA Programming and Performance cuda , nsight , gpu	24	2375	January 18, 2023
Best solution for maximizing bandwidth? More then 5.7G H->D bandwidth except Tesla CUDA Programming and Performance	24	11137	December 26, 2008

Why 2*RTX 2080ti run slower than 2*Tesla P100？

Related topics

Why 2RTX 2080ti run slower than 2Tesla P100？