Such a breakeven limit is likely a function of the GPU’s instruction cache size, as well as your code locality (which could easily be data dependent.)
It’s very likely that you can try several unroll values and pick the fastest and stop worrying any deeper. Not only is the GPU’s instruction cache size and behavior (and even existence!) undocumented, but as a guess the simple empirical testing of a few N values will likely do as good as any deeper analysis or heuristic.
I wonder if CPU compilers try to do this kind of optimization, perhaps using profile-guided optimization to find common code that is worth unrolling more even at the cost of using up more instruction cache.