Hi, all
I have two questions about cache() directive. I want to implement a block-matrix multiplication A X B = C. So each GPU thread block will handle a sub-matrix of C. I’m going to load a “block” of the matrix A and B into the shared memory. Here is my code. The tidx and tidy are the Thread Block level. So in each Thread Block, I want to load a sub-matrix of A and B into the shared memory. I’ve tried to use the cache() to load a sub-matrix, but it seems that the compiler(I’m using PGI 15.7) ignored the directive. So how can I fix this? And the second question is, I am not sure if cache() works for device pointer? Can I allocate a buffer on device directly and use the cache() to load this buffer into the shared memory?
for (k=0; k< block_num; k++)
{
#pragma acc kernels
#pragma acc loop gang independent
for (tidx=0;tidx<block_num;tidx++)
{
#pragma acc loop gang independent
for (tidy=0;tidy<block_num;tidy++)
{
#pragma acc cache(a_a[tidx*block_size:block_size][k*block_size:block_size],b_a[k*block_size:block_size][tidy*block_size:block_size])
{
#pragma acc loop independent vector(16)
for (i=0; i< block_size; i++)
{
#pragma acc loop independent vector(16)
for (j=0; j< block_size; j++)
{
// #pragma acc cache(a_a[tidx*block_size+i][k*block_size+j],b_a[k*block_size+i][tidy*block_size+j])
for (kk=0;kk<block_size;kk++)
{
c_a[tidx*block_size+i][tidy*block_size+j]+=p_alpha*a_a[tidx*block_size+i][k*block_size+kk]*b_a[k*block_size+kk][tidy*block_size+j];
}
}
}
}
}
}
}