Random shuffling an array in parallel

I am working on shuffling the data in a 2D array. I have to randomly swap all the elements from one row to all the elements from another row. I know Fisher-Yates is an amazing algorithm to solve it in serial. I am actually working with CUDA on GPU and need to implement the shuffling an array in parallel. Now, I know the serial implementation can still execute on GPU but that would be a waste of parallel computational resources. Any suggestion would be much appreciated. Thanks