SHFL for Ampere and Volta

Is SHFL anymore faster than shared memory for new architectures?

Is SHFL using shared memory internaly (V100 and A100 architecture)?

Is it possible that GEMM has benefited from SHFL?