Transposing 2dim Array

Hello

In order to get stride-1-access I have to transpose some 2dim arrays of my code. Lets say I have an array of size A(6,2’000’000) and I want to transpose it. Is it better to do this on the host or is there a way to make it fast on the device using the accelerator model?
If I do it just like that:

!$acc region
 	 do i=1,6
 	    do j=1,knend
 	        A_transposed(j,i)=A(i,j)
 	    end do
 	 end do 
!$acc end region

I get very poor performance.
Is there an efficient way of transposing arrays on the device using the PGI Accelerator model?

Thank you!

Hi elephant,

My guess as to why you are seen poor performance is due to extra host to device copies. Are you using data regions? Try doing something like the following where A is only copied to and from the device once and A_transposed is local to the device so never copied.

% cat transpose.f90 

program trans
implicit none
real, allocatable, dimension(:,:) :: A, A_transposed
integer :: i, j, knend
knend=200000
allocate(A(6,knend), A_transposed(knend,6))
A=1.0

!$acc data region copy(A), local(A_transposed)
! Copy A to the device and create a A_transposed locally on the device

!$acc region
     do i=1,6
        do j=1,knend
            A_transposed(j,i)=A(i,j)
        end do
     end do
!$acc end region 

!$acc region
     do i=1,6
        do j=1,knend
            A_transposed(j,i)=A_transposed(j,i)*6.0 
        end do
     end do
!$acc end region 

!$acc region
     do i=1,6
        do j=1,knend
            A(i,j)=A_transposed(j,i)
        end do
     end do
!$acc end region 
!$acc end data region  
! A is copied back the host here

print *, A(3,100)
deallocate(A, A_transposed)

end program trans

Hope this helps,
Mat