Scaling on different architectures

I realise this falls well into the “crystal ball gazing” realm of analysis, but am interested in any comment.

I have a task utilising a couple of what I believe are quite well optimised kernels, running 100% uint32_t integer and logic code, (LOP3 instructions overwhelmingly dominate). Looking at Nsight Compute, the first kernel shows “SM Busy 77%”, “Memory Busy 19%” (>90% Global Stores), “Issue Slots Busy 53%”.

The second kernel shows “SM Busy 90%”, “Memory Busy 74%” (>98% Shared), “Issue Slots Busy 59%”.

The task has zero host involvement, bar launching kernels and very minimal result checking and is running on a Pascal GTX 1060 - 10 SM’s with 128 integer cores/SM and I’m reasonably confident if I were to run it on a GTX1080 with 20 SM’s, I’d see a doubling in performance.

I’m interested in an Ampere RTX 3080 which has 68 SM’s with 64 integer cores/SM, (a condition all architectures since Pascal suffer).

Setting aside, performance gains due to caching size/behaviour and instruction latency, is it a reasonable generalisation to assess the performance gain from using this card based on 34 SM’s, due to there only being half the number of integer units?

Or are there other Nsight Compute metrics I can check with the current setup, which might indicate there is some unused capacity and thus the Ampere restriction is not potentially as severe?

Thanks

Obviously it would be best to simply try it instead of speculating on performance across GPU architectures, which seems like an iffy proposition. Given the code characterization, for a back-of-the-envelope comparison, I would simply compare the throughput of LOP3 instructions, i.e. theoretical LOP3s per second. A couple of thoughts:

(1) The throughput of LOP3 instructions may not be the same as the throughput of simple integer instructions on Pascal. I have not looked into the details. From briefly looking at code generated for Ampere it seems all logical operations get converted into LOP3 (using the zero register as third source if need be), suggesting that LOP3 throughput is the same as throughput for elementary integer operations.

(2) The sustained operating frequency of a Pascal GPU could be a bit higher than that of an Ampere GPU. I find that databases of GPU specifications can be a poor guide in this regard. For example, TechPowerUp lists the Quadro P2000 with a boost clock of 1480 MHz, but I often see mine running at 1650 to 1700 MHz, sustained for hours at full load (the cannot-exceed boost clock appears to be 1721 MHz).

Thanks Norbert. I’m picking I’d see around a 4 x speedup.

If we compare similar tiered products, GTX 1080 vs RTX 3080, a 2 x speedup for a card three generations newer and currently priced 50 - 80% higher than recommended retail, (here at least).

I think I’ll look at picking up a few second hand 1060-80s.

A friend has a 3070. I may try and test using his system.

Perhaps a quirk of their database only having one sample for that card. The P2000 uses the same chip as my 1060, albeit with two SM’s disabled and I see the 1650 to 1700 range as well.

It seems that compared to Pascal-based models more recent GPUs are often more limited in their clock boosting by thermals, both from what I read on websites and see with my own Quadro RTX 4000. If I recall correctly, the cannot-exceed boost limit on that is something like 1810 MHz, but I rarely see it sustain more than 1500 MHz or thereabouts.

Certainly scaling based on measured RTX 3070 results should allow for a more accurate performance prediction.

Again, apples vs oranges, but somewhat colouring the pessimism is this site, comparing chosen cards running various crypto mining algorithms - code I imagine not too dissimilar to my own, although I have no crypto currency experience:

Unless I screwed up my math, the performance ratio based on theoretical throughput of LOP3 on GTX 1080 vs RTX 3080 comes out to 3.3x. All information I can find seems to suggest that LOP3 has the same throughput as simple ALU operations on Pascal, so I simply looked at SM counts and operating frequencies. Assuming that thermals will adversely affect clock boost on GTX 3080 under sustained full load, and derating to 85% to take that into account, still leaves a factor of 2.85x. The Ethereum ETHASH results at the website linked above suggest a performance ratio of 2.7x (36.158 vs 97.878). Seems close enough.

Are the LOP3s in your code all generated by the compiler, or is this hand-crafted code? While the CUDA 11.1 compiler generally seems to do a great job of massaging a passel of logic operations into a near optimal sequence of LOP3s, it can at times be beaten by a determined human. With decidedly non-trivial effort I have succeeded twice so far.

I am not up to speed on algorithms for determining the most efficient way to map random logic equations to N-input logic functions for N > 2. I seem to recall from working with early Xilinx FPGAs that used table-based 5-input logic functions as their basic building block (CLB) that optimal solutions could not be found reliably, but that was 30 years ago and I would hope progress has been made since then.

It’s still a tough call - at current second hand prices, I can get 2 x GTX 1080 for half the price of an RTX 3080. The inconvenience of multiple cards and higher power consumption are not considerations and the task is easily divided across multiple cards.

I know the market (AI and gaming), are driving FP and Tensor performance at the expense of integer, and a glance across the integer/logic throughput performance shown here since 6.x shows the damage:

Were the boot on the other foot and I were using floats, I’d be singing from the rafters with Ampere :)

Definitely compiler generated and my hat’s off to the nvcc developers. I’m regularly amazed at the SASS that appears vs. what I attempt to get it to digest. If your efforts have been non-trivial, then rest assured I’m not even going to come close. Even then and with due respect, we’re probably only talking single digit percentage gains.

Given the R&D has already been amortised, perhaps limited runs of the TitanXp could be produced, (with extra SM’s to take advantage of process shrinkage), for integer holdouts…

Happy New Year.

Agreed. I also agree that if you don’t have to take the cost of power into consideration, multiple used GTX 1080s seems the way to go at this point in time.

Happy New Year to you as well.