Whether to support a load-execute instruction type is a design choice made by processor architects. From work on x86 processors I know that load-execute instructions can improve efficiency via a reduction in dynamic instruction count and in execution latency, at the expense of making the control path more complicated. The alternative approach, typically chosen by “RISC” processors, is to employ a pure load-store architecture.
It is possible that NVIDIA processor architects changed their approach from older GPUs and that load-execute instructions are no longer supported in the latest GPU, presumably to simplify op steering / scheduling. I have not checked into that.
Before we reach such a conclusion, you might want to check that the code idiom you are observing is not simply a consequence of multiple uses of the same constant-memory data. In the case of multiple uses of the same constant-memory data object, moving the data to a register first may have advantages in the compiler’s view. This could be driven by any kind of heuristic considering instruction scheduling, resource contention, energy efficiency, etc.
While that is true, an access to constant memory may still be more expensive than an access to the register file. In older GPUs, the assumption was that the costs are “close enough” and that a constant-memory access occurs with “near-register speed”, making constant memory access an attractive alternative especially on register-starved early GPU architectures. The balance may have shifted more strongly in favor of register access for best performance in the recent past, but that is just (reasoned) speculation. I don’t think NVIDIA has published a detailed discussion of these tradeoffs for their recent GPU architectures.
[Later:]
After poking around a bit, I see load-execute instructions with constant-bank references being generated up to and including sm_89, but not for sm_90 and later architectures. So it does seem like load-execute instructions no longer exist in the latest GPU architectures.