Dense array - Dense matrix multiplication

How to implement a dense array - dense matrix (axm) multiplication in OpenACC. I tried this:

#pragma acc data copyin(m[0:(N*N)]) copy(a[0:N])
    {
        #pragma acc region
        for(int i = 0; i < N; i++){
            #pragma acc for seq
            for(int j = 0; j < N; j++)
                a[i] = sum(a[i], a[j] * m[i + j*N]);
    }

But it seems that it is not being well parallelized. I want each thread compute a position of the resultant array, i.e, a a value.

Hi dsilva92883,

The code as written isn’t parallelizable and I don’t think correct even for the sequential case. The problem being is that you’re updating A while also reading from it so getting conflicts.

What I would do, is create a copy of A to store it’s old values before updating it. Something like:

#pragma acc data copyin(m[0:(N*N)]) copy(a[0:N]) create(aold[0:N])
   {
    #pragma acc kernels
     {
        for(int i = 0; i < N; i++){
           aold[i] = a[i];
        }
        #pragma acc loop gang
        for(int i = 0; i < N; i++){
            sum = 0.0;
            #pragma acc loop vector reduction(+:sum)
            for(int j = 0; j < N; j++)
                sum +=  aold[j] * m[i + j*N]);
            a[i] = sum;
        }
      }
    }

Hope this helps,
Mat

Thanks, Mat. Another question: Imagine I’m doing this multiplication inside a for loop:

for(k=1; k < N-1; k++){
/*Procedure of dense array times dense matrix.*/
}

I wonder if the part:

for(int i = 0; i < N; i++){ 
   aold[i] = a[i]; 
}

can somehow replaced by a swap between array pointers (for a k > 1), which would save some time. I don’t know how to do that (didn’t find a reference for that in OpenACC Docs) and even whether OpenACC allows me to do such a thing or not.

Best Regards,
Daniel

Hi Daniel,

can somehow replaced by a swap between array pointers

Sure. The host to device association of variables is by the host address not the variable name, so pointer swapping on the host is fine.

Something like:

#pragma acc data copyin(m[0:(N*N)]) copy(a[0:N]) create(aold[0:N]) 
   { 

#pragma acc kernels loop present(a,aold)
      for(int i = 0; i < N; i++){ 
           aold[i] = a[i]; 
        } 

for(k=1; k < N-1; k++){ 

     if (k>1) {
       // swap pointers
       tmp = a;
       a = aold;
       aold = tmp;
     }

    #pragma acc kernels present(a,aold)
     { 
          #pragma acc loop gang 
        for(int i = 0; i < N; i++){ 
            sum = 0.0; 
            #pragma acc loop vector reduction(+:sum) 
            for(int j = 0; j < N; j++) 
                sum +=  aold[j] * m[i + j*N]); 
            a[i] = sum; 
        } 
      } 
    }
  • Mat

Mat,

The swap stuff didn’t work. Two questions:

  • (Concerning to the version with no swapping) -> The code is only parallelized by Open ACC when I declare a as restrict otherwise the compiler outputs:


Complex loop carried dependence of a-> prevents parallelization
Loop carried dependence of aold-> prevents parallelization
Loop carried backward dependence of aold-> prevents vectorization

Is that supposed to happen?

  • (Concerning to the version with swapping) -> If I use restrict with a, this means the object pointed by a is only accessed by the a pointer, doesn’t it? Thus I can’t swap pointers like I’m doing, right?
    Ps 1.: I tested with and without restrict, in both cases I get wrong values.

Daniel