Also, this patent is closest to NVIDIA’s current implementation:
http://www.google.com/patents/about?id=_jKoAAAAEBAJ
And this one from 1982, cited by Fung et al, is the earliest implementation I am aware of:
http://www.google.com/patents/about?id=nsc0AAAAEBAJ
(it’s already fairly close to NVIDIA’s implementation…)
This last reference’s not so good ;)
It explains how we handle forward branches in Barra 0.1, but not so much how NVIDIA does.
We are working on an updated version of the report…
It is true that there are not many academic papers on the subject, which is unfortunate as I think it has many interesting aspects, and that there is still room for many optimizations.
For instance, my personal favorite is Intel’s implementation, as described in section 11.3.4.8 of:
http://intellinuxgraphics.org/Vol_4_G45_subsystem.pdf
It implements the whole mask stack as a row of counters and runs all control flow operations in constant space and constant time. I like it. :)