Why scalar processors?

Also, this patent is closest to NVIDIA’s current implementation:

http://www.google.com/patents/about?id=_jKoAAAAEBAJ

And this one from 1982, cited by Fung et al, is the earliest implementation I am aware of:

http://www.google.com/patents/about?id=nsc0AAAAEBAJ

(it’s already fairly close to NVIDIA’s implementation…)

This last reference’s not so good ;)

It explains how we handle forward branches in Barra 0.1, but not so much how NVIDIA does.

We are working on an updated version of the report…

It is true that there are not many academic papers on the subject, which is unfortunate as I think it has many interesting aspects, and that there is still room for many optimizations.

For instance, my personal favorite is Intel’s implementation, as described in section 11.3.4.8 of:

http://intellinuxgraphics.org/Vol_4_G45_subsystem.pdf

It implements the whole mask stack as a row of counters and runs all control flow operations in constant space and constant time. I like it. :)

Cho probably means that to get the best out of them, you need to program them in a vector fashion. ie as if it was a vector of size half-warp.