About "register bank-conflict"

We known that bank conflict for shared-memory,
I want to known is there any similiar apperance or detail for register in cuda?

There is no official documentation, but several have backengineered a reasonable model. On Maxwell at least, there are four register banks and operands for any instruction like an FMA can read at most one value from each bank per clock.
This is done at the SASS level, so the ptxas compiler handles register assignment and juggles the mapping to minimize any conflicts for you. This is easier in Maxwell and Pascal, which have an argument reuse cache that often reduces the number of registers needed for collection.

More info here and expecially here. Scott’s writing is where I learned all this.

Worrying about register banking is code analysis at the very very lowest level… even top CUDA programmers will not worry about it since ptxas will optimize things for you. SASS hackers will indeed worry about it, but those with such skill and nerve are exceedingly rare, universally honored and respected, and completely insane.

I myself have noticed one effect of register banking. In my PRNG research, I have thousands of small composite routines mosly of tight integer math that I’ve hand written just to brute force test for entropy mixing and speed. Whenever a new CUDA toolkit is released I check the benchmarks of these to look for compiler benchmark regressions to report as bugs if they’re significant. It turns out that perhaps 5% of the routines will spike in speed up or down anywhere from 5% to 25%. Analyzing the SASS of these differences shows the cause tends to be a different register assignment with more or less conflicts causing the speedup/slowdown.

Thank you very much
I’ve read your paper,
but, I cann’t visit the http://code.google.com/p/asfermi/.,so I cann’t get asfermi,do you have another url’address, or any other similar tools ?