Improving Memory Bounded Operation

I am looking for a way to imptove the performance of panel factorisation in LU decompositin. I foud an old document about Vectorized Memory Access ( Increase Performance with Vectorized Memory Access). Is this idea working with the new generation of CUDA and Volta&Ampere device?

I know that panel factorization is memory bounded operation, so for the new generation of CUDA and GPUs which tecnology can improve the panel factorization?

vectorized access is possible on any CUDA device. Whether or not it increases performance in your use case, I cannot say.