Shared memory approach with OpenACC

Hi,

I’m trying to use shared memory to cache things with OpenACC.

Basically what I’m working on is a matrix multiplication, and what I have is this:

typedef float ff;

// Multiplies two square row-major matrices a and b, puts the result in c.
void mmul(const restrict ff* a,
          const restrict ff* b,
          restrict ff* c,
          const int n) {
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n])
{

#pragma acc region
{

#pragma acc loop independent vector(16)
  for (int i = 0; i < n; ++i) {
#pragma acc loop independent vector(16)
    for (int j = 0; j < n; ++j) {
      ff sum = 0;
      for (int k = 0; k < n; ++k) {
        sum += a[i + n * k] * b[k + n * j];
      }
      c[i + n * j] = sum;
    }
  }

}
}
}

What I would like to do is use shared memory to cache a tiles of the matrices ‘a’ and ‘b’ to use in the computation of ‘c’, in a similar fashion to what the CUDA mmul algorithm does.

I understand I can use the

#pragma acc cached

directive, but I’m having some trouble understanding how that’s gonna be mapped to the CUDA architecture.

Basically on CUDA I would know the exact size of my blocks, and would be able to:

  • declare a shared memory with the size of the block
    copy the ‘relevant’ part of the data to the block
    use this data

Is there a way to achieve something similar with OpenACC? Is there a good tutorial/resource on the use of the cached directive or on how to map some of the power of shared memory from CUDA to OpenACC?

Thanks

1 Like

Hi lechat,

This will be tough with this code. The cache directive only works on contiguous data segments so you can only use it on B but caching B wont help. What you want here is for the reused portions of “a” to be coalesced into a shared memory, but unfortunately it doesn’t do this. You might be able to block the code but I’m not sure it would help much.

Note, that unless you’re using a C1060, shared memory usage is not as critical. Newer NVIDIA cards have hardware caching.

  • Mat

Hi Mat, thanks for you reply!

I have a GTX 280, and although I have a good performance compared with the CPU, I’m still about 10x slower than CUBLAS (I’ve included my timing information below).

(size, serial, acc, cublas)
(256, ‘0.014000’, ‘0.002000’, ‘0.000000’)
(512, ‘0.112000’, ‘0.014000’, ‘0.006000’)
(768, ‘0.380000’, ‘0.028000’, ‘0.006000’)
(1024, ‘0.900000’, ‘0.068000’, ‘0.012000’)
(1280, ‘1.732000’, ‘0.118000’, ‘0.022000’)
(1536, ‘2.996000’, ‘0.206000’, ‘0.034000’)
(1792, ‘4.748000’, ‘0.294000’, ‘0.048000’)
(2048, ‘7.092000’, ‘0.446000’, ‘0.068000’)
(2304, ‘10.080000’, ‘0.634000’, ‘0.090000’)
(2560, ‘13.820000’, ‘0.868000’, ‘0.122000’)
(2816, ‘18.382000’, ‘1.186000’, ‘0.158000’)
(3072, ‘23.878000’, ‘1.462000’, ‘0.196000’)
(3328, ‘30.337999’, ‘2.002000’, ‘0.244000’)
(3584, ‘37.897999’, ‘2.414000’, ‘0.296000’)
(3840, ‘46.722000’, ‘3.160000’, ‘0.358000’)
(4096, ‘56.584000’, ‘3.220000’, ‘0.428000’)

What I was looking for was to make my ACC implementation more like the CUDA implementation (which uses shared memory for caching).

Any tips on how to improve my performance in this case would be greatly appreciated.

Thanks

Hi lechat,

When I duplicate you’re code, my OpenACC performance is roughly double of what you show (0.34 seconds for the 2304 size with your schedule, 0.29 with the schedule shown below). Granted, I’m using a C2070 but don’t think that accounts for all of the difference. What’s the output when you compile with “-Minfo=accel”?

Note that you put the “restrict” keyword in the wrong place so the compiler may not be optimizing your code as well as it could. Also, you can get better performance by making the “i” loop use a longer vector length.

Here’s my version of this routine:

void mmul(int size, float * restrict A, float * restrict B, \
            float * restrict C)
{

  // Compute matrix multiplication.
#pragma acc kernels copyin(A[0:size*size],B[0:size*size]) \
  copyout(C[0:size*size])
  {
#pragma acc loop gang vector(256)  independent
    for (int i = 0; i < size; ++i) {
#pragma acc loop gang vector(2) independent
      for (int j = 0; j < size; ++j) {
            float tmp = 0.0;
            for (int k = 0; k < size; ++k) {
               tmp += A[i + k*size] * B[k + j*size];
            }
            C[i + j*size] = tmp;
         }
    }
  }
}



I’m still about 10x slower than CUBLAS (I’ve included my timing information below).

I would expect CUBLAS to be faster since it’s been hand optimized. However 10x seems a bit high. Let see if you can get better performance using the version above. Also, are you including data transfer time in your CUBLAS timings? At least for me, my OpenACC time does include data transfer times.

  • Mat