How to understand register bank on RTX A4000

Shaquille · September 18, 2023, 12:07pm

I create a micro-benchmark to test the register bank of RTX A4000
I apply an assembler to test my micro-benchmark
I just execute 4 repeat asm clauses to test it, like this:
CodeA:

[B------:R-:W-:-:S1]  FFMA R4, R9, R10, R4; 
[B------:R-:W-:-:S1]  FFMA R5, R9, R10, R5; 
[B------:R-:W-:-:S1]  FFMA R6, R9, R10, R6; 
[B------:R-:W-:-:S1]  FFMA R7, R9, R10, R7;

repeat above 4 clauses for 32 times in one iteration(inner loop), and the outer iteration is 256(outer loop), so, I think my test result is credible.

I found its performance just about 55% of peak performance, it is out of my expectation.
So, I added “.reuse” for them, like this:
CodeB:

[B------:R-:W-:-:S1]  FFMA R4, R9.reuse, R10.reuse, R4; 
[B------:R-:W-:-:S1]  FFMA R5, R9.reuse, R10.reuse, R5; 
[B------:R-:W-:-:S1]  FFMA R6, R9.reuse, R10.reuse, R6; 
[B------:R-:W-:-:S1]  FFMA R7, R9.reuse, R10.reuse, R7;

and then, it reach 95% of the peak performance.

So, I’m very confused,
I’m acknowleged that the register bank of CUDA is 4 before Volta, and 2 after Volta, from these two papers: <1804.06826_Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking.pdf> and

I think codeA meet the requirement of 2 banks.
I think it should be 95%+ peak performance without “.reuse”
But, why it is only 50% peak performance?

I think there must be some knowledge which I didn’t know
So, Is there anyone would like to tell me the secret?

rs277 · September 18, 2023, 7:00pm

My understanding, (which could well be flawed), is that while “reuse” can help in register bank conflicting situations, it’s main advantage is that it reduces instruction latency. I’ve come across the figure of one cycle saved, but I can’t find the reference at the moment.

There’s nothing official relating to this I’ve come across, but Scott Grey’s Maxas work and the series of papers you reference have helped.

Topic		Replies	Views
How to optimize my cuda code? CUDA Programming and Performance	14	2407	June 28, 2023
".reuse" in SASS instructions CUDA Programming and Performance	5	3134	October 8, 2017
About "register bank-conflict" CUDA Programming and Performance	2	4606	February 14, 2017
Register Bank trace CUDA Programming and Performance	6	1491	April 17, 2018
Little questions about conflict bank CUDA Programming and Performance	7	7491	April 26, 2011
Better control of register use CUDA Programming and Performance	4	1957	July 1, 2009
Bank conflicts and reuse flags in Pascal CUDA Programming and Performance	3	1079	April 13, 2017
Forcing register reuse in a loop CUDA Programming and Performance	9	3421	March 6, 2010
low down on avoiding register bank conflicts? CUDA Programming and Performance	7	3057	October 10, 2009
Analysing the registers CUDA Programming and Performance	9	1357	March 13, 2012

How to understand register bank on RTX A4000

Related topics