How to understand register bank on RTX A4000

I create a micro-benchmark to test the register bank of RTX A4000
I apply an assembler to test my micro-benchmark
I just execute 4 repeat asm clauses to test it, like this:

[B------:R-:W-:-:S1]  FFMA R4, R9, R10, R4; 
[B------:R-:W-:-:S1]  FFMA R5, R9, R10, R5; 
[B------:R-:W-:-:S1]  FFMA R6, R9, R10, R6; 
[B------:R-:W-:-:S1]  FFMA R7, R9, R10, R7; 

repeat above 4 clauses for 32 times in one iteration(inner loop), and the outer iteration is 256(outer loop), so, I think my test result is credible.

I found its performance just about 55% of peak performance, it is out of my expectation.
So, I added “.reuse” for them, like this:

[B------:R-:W-:-:S1]  FFMA R4, R9.reuse, R10.reuse, R4; 
[B------:R-:W-:-:S1]  FFMA R5, R9.reuse, R10.reuse, R5; 
[B------:R-:W-:-:S1]  FFMA R6, R9.reuse, R10.reuse, R6; 
[B------:R-:W-:-:S1]  FFMA R7, R9.reuse, R10.reuse, R7; 

and then, it reach 95% of the peak performance.

So, I’m very confused,
I’m acknowleged that the register bank of CUDA is 4 before Volta, and 2 after Volta, from these two papers: <1804.06826_Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking.pdf> and

I think codeA meet the requirement of 2 banks.
I think it should be 95%+ peak performance without “.reuse”
But, why it is only 50% peak performance?

I think there must be some knowledge which I didn’t know
So, Is there anyone would like to tell me the secret?

My understanding, (which could well be flawed), is that while “reuse” can help in register bank conflicting situations, it’s main advantage is that it reduces instruction latency. I’ve come across the figure of one cycle saved, but I can’t find the reference at the moment.

There’s nothing official relating to this I’ve come across, but Scott Grey’s Maxas work and the series of papers you reference have helped.