Prevent blocks to overwrite same array value

Hi everyone,

I’m really new to cuda-fortran, and I’m trying to parallelize the following subfunction:

  subroutine function_tst(Y,X,A,Ilist,n1,n2,n3)
    implicit none
    integer,intent(in) :: n1,n2,n3
    integer,intent(in) :: Ilist(n1,n2)
    integer :: i
    real,intent(in) :: X(n3), A(n1,n1,n2)
    real,intent(out) :: Y(n3)
    DO i=1,n2
      Y(Ilist(:,i)) = Y(Ilist(:,i)) + MATMUL( A(:,:,i) , X(Ilist(:,i)) )
    END DO
  end subroutine function_tst

where Ilist is like a random list of numbers between 1 and n3.
The issue by trying to parallelize this is that each loop can modify the same elements from Y, which therefore blocks could overwrite each other results.

So I’m asking for some help, please.
You can find the full program in attachment, (28.1 KB)



Hi Remy,

This might be a bit tricky, but you probably can use atomic operations on the update of Y. Though since atomics only work on scalars or single elements, you’d need to modify the code to use explicit loops rather than array syntax. You might need to write out the matmul operations as well since here I believe the compiler needs to create a implicit temp array to hold the results from MATMUL.


Hi Mat,

Thank you for your answer.
So I’ve rewritten the function based on what you told me and I got this:

  attributes(global) subroutine function_tst2(Y,X,A,Ilist,n1,n2,n3)
    implicit none
    integer,device,intent(in) :: n1,n2,n3
    integer,device,intent(in) :: Ilist(n1,n2)
    integer,device :: i,j,k,l,istat
    real,device,intent(in) :: X(n3), A(n1,n1,n2)
    real,device,intent(out) :: Y(n3)
    real,device :: Buff

    i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
    IF (i <= n2) THEN 
      DO j=1,n1
        DO k=1,n1
            Buff = Buff + A(j,k,i) * X(Ilist(k,i))
        END DO
        istat = atomicadd( Y(l) , Buff)
      END DO
    END IF
  end subroutine function_tst2

This function gives me the right answer (same as the first function), but the calculation time is much faster.
For n1=30, n2=72150 and blocksize=1024, I got a time of 0.3sec for the first function and 4.3E-05 for the parallelized one, which is a big gain.

Thank you again.