I realise this falls well into the “crystal ball gazing” realm of analysis, but am interested in any comment.
I have a task utilising a couple of what I believe are quite well optimised kernels, running 100% uint32_t integer and logic code, (LOP3 instructions overwhelmingly dominate). Looking at Nsight Compute, the first kernel shows “SM Busy 77%”, “Memory Busy 19%” (>90% Global Stores), “Issue Slots Busy 53%”.
The second kernel shows “SM Busy 90%”, “Memory Busy 74%” (>98% Shared), “Issue Slots Busy 59%”.
The task has zero host involvement, bar launching kernels and very minimal result checking and is running on a Pascal GTX 1060 - 10 SM’s with 128 integer cores/SM and I’m reasonably confident if I were to run it on a GTX1080 with 20 SM’s, I’d see a doubling in performance.
I’m interested in an Ampere RTX 3080 which has 68 SM’s with 64 integer cores/SM, (a condition all architectures since Pascal suffer).
Setting aside, performance gains due to caching size/behaviour and instruction latency, is it a reasonable generalisation to assess the performance gain from using this card based on 34 SM’s, due to there only being half the number of integer units?
Or are there other Nsight Compute metrics I can check with the current setup, which might indicate there is some unused capacity and thus the Ampere restriction is not potentially as severe?
Thanks