I create a micro-benchmark to test the register bank of RTX A4000
I apply an assembler to test my micro-benchmark
I just execute 4 repeat asm clauses to test it, like this:
CodeA:
[B------:R-:W-:-:S1] FFMA R4, R9, R10, R4;
[B------:R-:W-:-:S1] FFMA R5, R9, R10, R5;
[B------:R-:W-:-:S1] FFMA R6, R9, R10, R6;
[B------:R-:W-:-:S1] FFMA R7, R9, R10, R7;
repeat above 4 clauses for 32 times in one iteration(inner loop), and the outer iteration is 256(outer loop), so, I think my test result is credible.
I found its performance just about 55% of peak performance, it is out of my expectation.
So, I added “.reuse” for them, like this:
CodeB:
[B------:R-:W-:-:S1] FFMA R4, R9.reuse, R10.reuse, R4;
[B------:R-:W-:-:S1] FFMA R5, R9.reuse, R10.reuse, R5;
[B------:R-:W-:-:S1] FFMA R6, R9.reuse, R10.reuse, R6;
[B------:R-:W-:-:S1] FFMA R7, R9.reuse, R10.reuse, R7;
and then, it reach 95% of the peak performance.
So, I’m very confused,
I’m acknowleged that the register bank of CUDA is 4 before Volta, and 2 after Volta, from these two papers: <1804.06826_Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking.pdf> and
I think codeA meet the requirement of 2 banks.
I think it should be 95%+ peak performance without “.reuse”
But, why it is only 50% peak performance?
I think there must be some knowledge which I didn’t know
So, Is there anyone would like to tell me the secret?