Why nvcc can't optimize load same address operation?

godbolt

I would like to know why this part was not optimized for consideration, can anyone tell me?
I am implementing a cuda function, through the C++ template compiler can analyzed to load the same address, but can not be optimized.
The above is a simple example