GTX980ti faster than RTX 2080ti?

I just started with CUDA development and wrote a simple app to test performance.
I tried to compare both of my GPUs to see the difference and notice that my GTX 980ti seems to outperform the RTX 2080ti which is weird… Am I doing something wrong here or what could cause this behavior?

Here’s the code I use to test this:

__global__ void device_add(long* a, long* b, long* c) {
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];

}

void run_on_device(int idx, long* a, long*b, long* c) {
	cudaSetDevice(idx);
	auto start = std::chrono::high_resolution_clock::now();

long size = N * sizeof(long);
long* da, * db, * dc;

cudaMalloc(&da, size);
cudaMalloc(&db, size);
cudaMalloc(&dc, size);

cudaMemcpy(da, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(db, b, size, cudaMemcpyHostToDevice);

for (int i = 0; i < 10; i++) {
	device_add << <1, N >> > (da, db, dc);
}

cudaMemcpy(c, dc, size, cudaMemcpyDeviceToHost);

auto elapsed = std::chrono::high_resolution_clock::now() - start;
long long microseconds = std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
printf("Device %d time: %d\n",idx, microseconds);
cudaFree(da); cudaFree(db); cudaFree(dc);

}

int main(void) {

cudaDeviceProp* prop = new cudaDeviceProp();
cudaGetDeviceProperties(prop, 0);
printf("Device 0: %s\n", prop->name);
cudaGetDeviceProperties(prop, 1);
printf("Device 1: %s\n", prop->name);

long* a, * b, * c;

long size = N * sizeof(long);

auto start = std::chrono::high_resolution_clock::now();

a = new long[N]; fill_array(a);
b = new long[N]; fill_array(b);
c = new long[N];

run_on_device(0, a, b, c);
run_on_device(1, a, b, c);

delete[] a; delete[] b; delete[] c;

return 0;

}

When I run this, this is the output I get:

Device 0: GeForce RTX 2080 Ti
Device 1: GeForce GTX 980 Ti
Device 0 time: 104431
Device 1 time: 50073

Did some more digging and it seems that the memory allocation is causing the difference. When I only time these lines:

cudaMalloc(&da, size);
cudaMalloc(&db, size);
cudaMalloc(&dc, size);

The 980ti is much much faster than the 2080ti…
That doesn’t make sense, right?

Are you testing each card in the same slot on the motherboard?

If not, it’s possible that the 2080 is in a slot that does not have 16 PCIe lanes.

The nvidia-smi utility will confirm PCIe connectivity.

There on different ports and here’s the PCI info from the SMI utility:

I think these look normal?

RTX 2080ti:

PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1E0710DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x12FB196E
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 7000 KB/s
Rx Throughput : 36000 KB/s

GTX 980ti:
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x17C810DE
Bus Id : 00000000:02:00.0
Sub System Id : 0x139419DA
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s

There’s your problem with the 2080:
Link Width
Max : 16x
Current : 8x

So that slot, although socketed to take 16 lane cards, is only wired or configured with 8 lanes.

Check your motherboard manual, perhaps that slot can be reconfigured to 16 x.

EDIT: Looking harder, there seem to be several issues with both cards. The 2080 is a PCIe Gen 3 card and it says it’s currently only connected at Gen 1 speed and the 980 (Gen 3) is also connected at Gen 1 and and only using 4 of the 16 lanes.

Very odd. What is the motherboard?

I’m using a ASUS ROG Strix Z490-H.

I’ll check the bios to see if I can find something there…

Checking the manual and it seems that the first pcie slot has the 16x but my 2080ti is connected to the second pcie and the gtx to the 3th pcie
The first slot contains my soundcard… Guess if I switch the soundcard with the 2080ti it should be ok…

Does that mean I also see lower performance in things like games or is that not as much?

edit: although the manual says that single GPU it runs at 16x and with dual GPU it runs at 8x while the others run at 4x with dual gpu and that’s what we see…
We see 8x on the 2080 and 4x on the 980ti but that doesn’t explain why the 980 is faster, right?

Not necessarily, but certainly if you’re wanting to get the best out of what are nice cards, you really want both in Gen 3 16x slots.PCI

EDIT: And it’s not clear why the speed is only Gen 1 on each, unless the manual does not mention that they reduce the speed along with the number of lanes as you add more cards.

Edit2: “The first slot contains my soundcard… Guess if I switch the soundcard with the 2080ti it should be ok…”

No, looking at the manual, the only way you can get x16, is to only have one of the three slots occupied. Add any card to either of the others and it drops to 8x.

I understand that my motherboard does this…
But according to the doc, the first slot drops from 16 to 8 and the second to 4. That’s what the SMI utility shows…

So the 2080ti runs on 8 lanes while the 980ti on 4 lanes but the 980to is still much faster in allocating memory.
That also doesn’t make sense, right?

The only reason I can think of, is that the 2080ti is my primary GPU and it runs my triple monitor on it… But even so, I can hardly believe it would cause such a drop in performance for such a card…

I think I have looked at it from a wrong angle…
I don’t think it’s a card issue or pcie issue but something I (yet) don’t understand from the CUDA architecture…

Here some tests I’ve been doing:
Device 0 is the RTX, device 1 is the GTX:

run_on_device(0, a, b, c); —> 470ms
run_on_device(1, a, b, c); —> 325ms

run_on_device(1, a, b, c); —> 500ms
run_on_device(0, a, b, c); —> 320ms

run_on_device(0, a, b, c); —> 500ms
run_on_device(0, a, b, c); —> 319ms

run_on_device(1, a, b, c); —> 500ms
run_on_device(1, a, b, c); —> 320ms

So it seems that only the first invocation is slow and the rest is faster, no matter what GPU I use as the first invocation.

Your timing methodology is flawed. Many factors related to both hardware and software typically cause the first invocation of anything to be slower than subsequent invocations. In analogy to car engines, this is also known as a “cold start”. Best practice is to “warm up” hardware and software, until processing reaches “steady state” (indicated by execution times fluctuating slightly around a nearly constant value).

A common approach to achieve this is to execute the code in question N times in a row and record the fastest time. This is what the STREAM benchmark does, where N defaults to 10. An alternative methodology could be to execute the code N times in a row, eliminate the fastest and the slowest times (eliminating outliers), then average the rest. So in this case one might want to chose N=5.

Note that these days, both CPUs and GPUs are dynamically clocked based on a variety of factors, and performance fluctuates accordingly. For example, on a typical GPU, clock frequency first increases quickly (but not instantaneously) from power saving mode to full performance mode. If the GPU hasn’t been in use for several minutes prior, it will be cool, and shortly after start of CUDA activity reach its highest operating frequency, say 1800 MHz. As the GPU keeps running CUDA code, it will heat up. Typically at a GPU temperature of around 60 deg C, the clock speed will start to reduce. Once the GPU reaches the upper limit of the supported temperature range after a couple of minutes of intense GPU usage, the GPU frequency now might be, say, 1500 MHz and stay there. This means you would want to keep the general environment for your benchmark runs roughly the same.

Thanks that is very helpful! I’ll do some tests with that methodology and see how it goes.

I’m glad nothing is wrong with the hardware :)
These little tests help you better understand the system…

Just a follow up, having done some more digging. It is important to check the nvidia-smi PCIe link parameters while the card is under load - not sitting idle. The driver reduces speeds to save power when not active.