The level of benefit may depend to some degree on the architecture. I haven’t really studied the PTX guide for these factoids, but the programming manual and if memory serves the A100 whitepaper gives some description of the benefits, at least with respect to the CUDA C++ version/intrinsics.
I think a very basic perusal of the generated SASS will identify what the register usage is. AFAIK, for cc 8.0 and beyond, it should not require registers to store the data in-flight from global to shared. There are registers used, of course, to indicate addresses and so forth.