It would have been beneficial if you had mentioned that from the start. The name of the attribute is self-explanatory, I would claim: you are asking the compiler to impose a bound (= hard limit). Which it duly did.
The approach of squeezing the compiler on registers used in order to increase occupancy was sometimes necessary on older GPUs, say, pre-Kepler. Can you actually demonstrate performance gains from using it? I have not found it useful in almost a decade, and have not used either
__launch_bounds__ as a consequence.