Hi,
I’m trying to use shared memory to cache things with OpenACC.
Basically what I’m working on is a matrix multiplication, and what I have is this:
typedef float ff;
// Multiplies two square row-major matrices a and b, puts the result in c.
void mmul(const restrict ff* a,
const restrict ff* b,
restrict ff* c,
const int n) {
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n])
{
#pragma acc region
{
#pragma acc loop independent vector(16)
for (int i = 0; i < n; ++i) {
#pragma acc loop independent vector(16)
for (int j = 0; j < n; ++j) {
ff sum = 0;
for (int k = 0; k < n; ++k) {
sum += a[i + n * k] * b[k + n * j];
}
c[i + n * j] = sum;
}
}
}
}
}
What I would like to do is use shared memory to cache a tiles of the matrices ‘a’ and ‘b’ to use in the computation of ‘c’, in a similar fashion to what the CUDA mmul algorithm does.
I understand I can use the
#pragma acc cached
directive, but I’m having some trouble understanding how that’s gonna be mapped to the CUDA architecture.
Basically on CUDA I would know the exact size of my blocks, and would be able to:
- declare a shared memory with the size of the block
copy the ‘relevant’ part of the data to the block
use this data
Is there a way to achieve something similar with OpenACC? Is there a good tutorial/resource on the use of the cached directive or on how to map some of the power of shared memory from CUDA to OpenACC?
Thanks