Using register operand collector to perform dual issue 3-operand instructions without bank conflict.

Some GPUs, including the Kepler family as it appears, will cache recent register operands so they don’t have to go to the register file.
This is referred to in the paper by Xiu, et al, at PPoPP2017_artifact/sgemm.pdf at master · PAA-NCIC/PPoPP2017_artifact · GitHub. This GitHub repository also includes tools mentioned in the paper for assembling SASS code to ISA binary.
The usefulness of the operand cache is that it allows the same register to be used twice in the same instruction, or in different instructions in the same thread, without having to go to the register file twice, and hence the thread can execute two three-operand instructions in a double issue (by doubling up on two of the source operands).
In the cited Xiu paper, there’s an example of three instructions:

FFMA V, R150, W, V    (single issue)
FFMA X, R150, R146, X (double issue)
FFMA Y, Z, R146, Y    (double issue)

R146 and R150 are in the same bank, and different from X, Y, and Z.

The paper cites an NVIDIA patent, found at https://patentimages.storage.googleapis.com/a8/70/94/16936cf6b77e43/US8639882.pdf. You can also look it up at the USPTO web site. The patent may be interesting to those of you looking for more GPU architectural details in general, as the patent describes a complete CPU / GPU system which includes this operand collector. For example, it discusses banks of registers.
Also cites patent at United States Patent: 8200949.
Be aware that these patents do not necessarily tell you what any particular GPU does. I assume that if NVIDIA has a patent issued, the intention is to use it in their GPUs.

I’d be interested to know any information specifically about the behavior of this collector on specific GPU models. How many recent operands can it store at a time, is this number separate for each register bank, and will an operand stay in the collector indefinitely (until evicted), and what is the replacement policy?