CUDA Fortran: Issues with vector accessing patterns

Hello,

I have recently encountered a problem in my code when trying to access vectors in a certain way. I tried to break it down as compactly as possible in this example code:

      module cuda_subs
      implicit none
      integer, parameter :: pr=8
  
      contains
!-------------------------------------------------------------------------------

      attributes(global) subroutine test(N,M,perm1_d,perm2_d,x_in_d, x_out_d, x_inout_d, check1_d,check2_d)
      integer,value, intent(in)    :: N,M

      integer, intent(in)          :: perm1_d(N*M), perm2_d(N*M)
      real(kind=pr), intent(out)   :: check1_d(5,N*M),  check2_d(5,N*M)

      real(kind=pr), intent(in)    :: x_in_d(N*M) 
      real(kind=pr), intent(out)   :: x_out_d(N*M)
      real(kind=pr), intent(inout) :: x_inout_d(N*M)

      real(kind=pr) :: x_local1_d(N*M), x_local2_d(N*M)

      integer :: t, i
      
      
      t=threadIdx%x
      i=blockDim%x*(blockIdx%x-1) + threadIdx%x 

      x_out_d    = 44.0_pr
      x_inout_d  = 55.0_pr
      x_local1_d = 66.0_pr
      x_local2_d = 77.0_pr

      if(t <= N) then
        x_local1_d(i) = x_in_d(i)
        x_local2_d(i) = x_in_d(i)

        x_out_d(i)   = x_local1_d(i)
        x_inout_d(i) = x_local2_d(i) 


         check1_d(1,i) = x_in_d(perm1_d(i))
         check1_d(2,i) = x_out_d(perm1_d(i))
         check1_d(3,i) = x_inout_d(perm1_d(i))
         check1_d(4,i) = x_local1_d(perm1_d(i))
         check1_d(5,i) = x_local2_d(perm1_d(i))

         check2_d(1,i) = x_in_d(perm2_d(i))
         check2_d(2,i) = x_out_d(perm2_d(i))
         check2_d(3,i) = x_inout_d(perm2_d(i))
         check2_d(4,i) = x_local1_d(perm2_d(i))
         check2_d(5,i) = x_local2_d(perm2_d(i))
      endif

      end subroutine test

!-------------------------------------------------------------------------------
       end module cuda_subs
!##########################################################################
!##########################################################################


       program main
       use cuda_subs
       use cudafor
       implicit none
       integer, parameter    :: N=3,M=3

       real(kind=pr), device :: x_in_d(N*M), x_out_d(N*M)
       real(kind=pr), device :: x_inout_d(N*M), x_local1_d(N*M)
       real(kind=pr), device :: check1_d(5,N*M), check2_d(5,N*M)
       integer,       device :: perm1_d(N*M), perm2_d(N*M)

       real(kind=pr) :: x_in(N*M), x_out(N*M), x_inout(N*M)
       real(kind=pr) :: check1(5,N*M), check2(5,N*M) 
       
       integer :: i,j, NM

       type(dim3) :: block_grid, thread_block 
  

       NM=N*M
       do i=1,NM
          x_in_d(i) = i
       enddo
       perm1_d(1:9) = (/ 1,2,3,4,5,6,7,8,9 /)   
       perm2_d(1:9) = (/ 9,8,7,6,5,4,3,2,1 /)

       block_grid   = dim3(M,1,1) 
       thread_block = dim3(N,1,1)

       call test<<<block_grid,thread_block>>>(N,M,perm1_d,perm2_d,x_in_d, x_out_d, x_inout_d, check1_d,check2_d)

       check1=check1_d
       check2=check2_d

       write(*,'(//,A8,/)') 'check1 ='
       do j=1,5
          write(*,'(F5.2,x,F5.2,x,F5.2,x, &
                    F5.2,x,F5.2,x,F5.2,x, &
                    F5.2,x,F5.2,x,F5.2,x)'),  &
                               check1(j,1), &
                               check1(j,2), &
                               check1(j,3), &
                               check1(j,4), &
                               check1(j,5), &
                               check1(j,6), &
                               check1(j,7), &
                               check1(j,8), &
                               check1(j,9)
       enddo

       write(*,'(//,A8,/)') 'check2 ='
       do j=1,5
          write(*,'(F5.2,x,F5.2,x,F5.2,x, &
                    F5.2,x,F5.2,x,F5.2,x, &
                    F5.2,x,F5.2,x,F5.2,x)'),  &
                               check2(j,1), &
                               check2(j,2), &
                               check2(j,3), &
                               check2(j,4), &
                               check2(j,5), &
                               check2(j,6), &
                               check2(j,7), &
                               check2(j,8), &
                               check2(j,9)
       enddo
 

       end program main

In this example code I only use three blocks of three threads each to keep it simple. The basic idea is that I have several (five) vectors x_*_d on the device, three of them am I passing via the subroutine call, and two are local (though one of these two is already declared in the main function). I initialize x_in_d to be [1,2,3,4,5,6,7,8,9], and I set up two “permutation vectors” to be [1,2,3,4,5,6,7,8,9] and [9,8,7,6,5,4,3,2,1], respectively.
In the subroutine I basically copy x_in_d to all the other x-vectors on the device and then copy all five x-vectors to my check-matrices (one using the first permutation vector, another using the second one). At last I copy those matrices to the RAM memory and print them to the screen.
The output is as follows:

check1 =

 1.00  2.00  3.00  4.00  5.00  6.00  7.00  8.00  9.00     ! x_in_d
 1.00  2.00  3.00  4.00  5.00  6.00  7.00  8.00  9.00     ! x_out_d
55.00 55.00 55.00  4.00  5.00  6.00  7.00  8.00  9.00     ! x_inout_d
 1.00  2.00  3.00  4.00  5.00  6.00  7.00  8.00  9.00     ! x_local1_d
 1.00  2.00  3.00  4.00  5.00  6.00  7.00  8.00  9.00     ! x_local2_d


check2 =

 9.00  8.00  7.00  6.00  5.00  4.00  3.00  2.00  1.00     ! x_in_d
44.00 44.00 44.00  6.00  5.00  4.00 44.00 44.00 44.00     ! x_out_d
55.00 55.00 55.00  6.00  5.00  4.00 55.00 55.00 55.00     ! x_inout_d
66.00 66.00 66.00 66.00  5.00 66.00 66.00 66.00 66.00     ! x_local1_d
77.00 77.00 77.00 77.00  5.00 77.00 77.00 77.00 77.00     ! x_local2_d

The first check-matrix reveals that all five x-vectors have the desired content and therefore x_in_d seems to be copied to the other four x-vectors correctly. However, when I reverse the order in which the elements are accessed only x_in_d has the expected behaviour, x_out_d and x_inout_d only seem to work as expected from element four on, and the local x-vectors only work correctly on the fifth element (where both permutation vector have the same value). In the positions in which the last four x-vectors do not behave as expected their values are the ones they were initialised to before x_in_d was copied to them.

Since all of the x-vectors should reside in the global memory of the GPU (shouldn’t they?) I do not understand why the order in which what thread accesses what element should make a difference. Though it appears that there is some scoping issue here.

Does anybody know why this is happening and how I could avoid this problem? I am working with the Compressed Sparse Row matrix storage, and this will have to deal with accesses of the form vec1(vec2(i)).

Nils

Hi Nils,

The problem here is that automatic arrays need to be “shared” and the size of the shared memory be passed in as the third argument of the launch configuration. For example:

% cat test.cuf

       module cuda_subs
       implicit none
       integer, parameter :: pr=8

       contains
 !-------------------------------------------------------------------------------

       attributes(global) subroutine test(N,M,perm1_d,perm2_d,x_in_d, x_out_d, x_inout_d, check1_d,check2_d)
       integer,value, intent(in)    :: N,M

       integer, intent(in)          :: perm1_d(N*M), perm2_d(N*M)
       real(kind=pr), intent(out)   :: check1_d(5,N*M),  check2_d(5,N*M)

       real(kind=pr), intent(in)    :: x_in_d(N*M)
       real(kind=pr), intent(out)   :: x_out_d(N*M)
       real(kind=pr), intent(inout) :: x_inout_d(N*M)

       real(kind=pr),shared :: x_local1_d(N*M), x_local2_d(N*M)

       integer :: t, i


       t=threadIdx%x
       i=blockDim%x*(blockIdx%x-1) + threadIdx%x

       x_out_d    = 44.0_pr
       x_inout_d  = 55.0_pr
       x_local1_d = 66.0_pr
       x_local2_d = 77.0_pr
       if(t <= N) then
         x_local1_d(i) = x_in_d(i)
         x_local2_d(i) = x_in_d(i)

         x_out_d(i)   = x_local1_d(i)
         x_inout_d(i) = x_local2_d(i)


          check1_d(1,i) = x_in_d(perm1_d(i))
          check1_d(2,i) = x_out_d(perm1_d(i))
          check1_d(3,i) = x_inout_d(perm1_d(i))
          check1_d(4,i) = x_local1_d(perm1_d(i))
          check1_d(5,i) = x_local2_d(perm1_d(i))

          check2_d(1,i) = x_in_d(perm2_d(i))
          check2_d(2,i) = x_out_d(perm2_d(i))
          check2_d(3,i) = x_inout_d(perm2_d(i))
          check2_d(4,i) = x_local1_d(perm2_d(i))
          check2_d(5,i) = x_local2_d(perm2_d(i))
       endif

       end subroutine test

 !-------------------------------------------------------------------------------
        end module cuda_subs
 !##########################################################################
 !##########################################################################


        program main
        use cuda_subs
        use cudafor
        implicit none
        integer, parameter    :: N=3,M=3

        real(kind=pr), device :: x_in_d(N*M), x_out_d(N*M)
        real(kind=pr), device :: x_inout_d(N*M), x_local1_d(N*M)
        real(kind=pr), device :: check1_d(5,N*M), check2_d(5,N*M)
        integer,       device :: perm1_d(N*M), perm2_d(N*M)

        real(kind=pr) :: x_in(N*M), x_out(N*M), x_inout(N*M)
        real(kind=pr) :: check1(5,N*M), check2(5,N*M)

        integer :: i,j, NM

        type(dim3) :: block_grid, thread_block


        NM=N*M
        do i=1,NM
           x_in_d(i) = i
        enddo
        perm1_d(1:9) = (/ 1,2,3,4,5,6,7,8,9 /)
        perm2_d(1:9) = (/ 9,8,7,6,5,4,3,2,1 /)

        block_grid   = dim3(M,1,1)
        thread_block = dim3(N,1,1)

        call test<<<block_grid,thread_block,N*M*2*pr>>>(N,M,perm1_d,perm2_d,x_in_d, x_out_d, x_inout_d, check1_d,check2_d)

        check1=check1_d
        check2=check2_d

        write(*,'(//,A8,/)') 'check1 ='
        do j=1,5
           write(*,'(F5.2,x,F5.2,x,F5.2,x, &
                     F5.2,x,F5.2,x,F5.2,x, &
                     F5.2,x,F5.2,x,F5.2,x)'),  &
                                check1(j,1), &
                                check1(j,2), &
                                check1(j,3), &
                                check1(j,4), &
                                check1(j,5), &
                                check1(j,6), &
                                check1(j,7), &
                                check1(j,8), &
                                check1(j,9)
        enddo

        write(*,'(//,A8,/)') 'check2 ='
        do j=1,5
           write(*,'(F5.2,x,F5.2,x,F5.2,x, &
                     F5.2,x,F5.2,x,F5.2,x, &
                     F5.2,x,F5.2,x,F5.2,x)'),  &
                                check2(j,1), &
                                check2(j,2), &
                                check2(j,3), &
                                check2(j,4), &
                                check2(j,5), &
                                check2(j,6), &
                                check2(j,7), &
                                check2(j,8), &
                                check2(j,9)
        enddo


        end program main
% pgf90 -Mcuda test.cuf -V16.1; a.out


check1 =

 1.00  2.00  3.00  4.00  5.00  6.00  7.00  8.00  9.00
 1.00  2.00  3.00  4.00  5.00  6.00  7.00  8.00  9.00
 1.00  2.00  3.00  4.00  5.00  6.00  7.00  8.00  9.00
 1.00  2.00  3.00  4.00  5.00  6.00  7.00  8.00  9.00
 1.00  2.00  3.00  4.00  5.00  6.00  7.00  8.00  9.00


check2 =

 9.00  8.00  7.00  6.00  5.00  4.00  3.00  2.00  1.00
 9.00  8.00  7.00  6.00  5.00  4.00  3.00  2.00  1.00
 9.00  8.00  7.00  6.00  5.00  4.00  3.00  2.00  1.00
66.00 66.00 66.00  6.00  5.00  4.00 66.00 66.00 66.00
77.00 77.00 77.00  6.00  5.00  4.00 77.00 77.00 77.00

The local output is correct in the second case. Block1 executes I=1,2,3, Block2 I=4,5,6 and Block3 I=7,8,9. Since the local arrays are shared within the block, only the corresponding “I” indices are set. So for Block1, only indices 1,2,and 3 are set with the rest having the initial value. Hence when you reverse the order, 9=>1, 8=>2, 7=>1, and these get set to the initial value. Only when I=4,5,6 does it map with the updated values, 4=>6, 5=>5, 6=>4.

Note that if local arrays were truly local, only I=5 would get the updated value.

Hope this helps,
Mat

Hi Mat,

thank you so much for your quick reply. I am still interested to know what happens behind the stage and also played a bit around with the code again and found some other weird behaviour (at least weird to me).

I assume that given that x_in_d is read-only it should be stored in the constant memory where it is accessible by all the threads. Next, I understand that x_local2_d is truly local and an automatic array variable stored in the global memory, however its scope is limited to single threads only which means that indeed all threads get a private copy of it. These private copies are than al intitialised to be 77.0 and later on the ith element of it will be assigned to x_in_d(i). On the other hand, x_local1_d should be global, as it has the device attribute when I declared it in the main program. Maybe me calling it “local” was not the best choice. But I wanted to distinguish it from the global array variables x_(in)out_d which are passed directly to the subroutine via its call from the main program.

I understand that when I declare the arrays x_local*_d to be shared in the way you did that this behaviour is correct.

I am most intrigued with the behaviour of the global variables though. As global variables (indicated by the device attribute) they should be stored in the global memory and accessible to all threads of all kernels. It is therefore still not clear why x_local1_d does not behave “corectly”. But looking at the variables x_(in)out_d might be even more interesting.
When changing the order in which I copy x_in_d the the other four variables to

         x_out_d(i)   = x_in_d(i)
         x_inout_d(i) = x_in_d(i)

         x_local1_d(i) = x_out_d(i)
         x_local2_d(i) = x_inout_d(i)

the output for the check matrix is only as the one you showed when the variables x_local*_d are shared, otherwise the output is as in my initial post.
Moreover, when commenting out the assignments x_local*d(i) = x(in)out_d(i) and only passing the non “local” variables to the check2-matrix so that the matrix is of size (3,NM) while still initialising the local variables to 66.0 and 77.0, respectively, the output is only correct/as expected when x_local_d are shared. However completely commenting out x_local*_d yields the correct result, obviously.

Could you be so kind and tell me why x_local1_d does not seem to be a global variable, why I get so strange outputs for the check2 matrix even when it does not depend on the x_local*_d variables anymore, and tell me if I made any assumptions here that are not correct?

Nils

Hi Nils,

I assume that given that x_in_d is read-only it should be stored in the constant memory where it is accessible by all the threads.

Arrays with intent in will be placed into texture memory. Constant memory if limited in size and silently fails if you exceed the available amount. Hence, the compiler wont use it by default. If you want the array to be placed in constant memory, use the “constant” attribute instead of “device”.

Could you be so kind and tell me why x_local1_d does not seem to be a global variable,

The “x_local1_d” in main is a different variable than the one declared in the kernel. Just because two variables have the same name, does not make them the same variable. They will be different if declared in different scopes. If you want “x_local1_d” to be global, declare it as a module variable.

why I get so strange outputs for the check2 matrix even when it does not depend on the x_local*_d variables anymore, and tell me if I made any assumptions here that are not correct?

Automatic arrays must be shared. If you remove the “shared” attribute and then access these arrays, you’ll most likely be stomping on other memory. If you make them fixed size, then the error will most likely go away.

Hope this helps,
Mat