In order to get stride-1-access I have to transpose some 2dim arrays of my code. Lets say I have an array of size A(6,2’000’000) and I want to transpose it. Is it better to do this on the host or is there a way to make it fast on the device using the accelerator model?
If I do it just like that:
!$acc region do i=1,6 do j=1,knend A_transposed(j,i)=A(i,j) end do end do !$acc end region
I get very poor performance.
Is there an efficient way of transposing arrays on the device using the PGI Accelerator model?