NVCC potentially missing a memory optimization

I see, I misunderstood your question. In this case, the fused kernel without this optimization is also around 15% slower than with it.