question about cache() directive

Hi, all

I have two questions about cache() directive. I want to implement a block-matrix multiplication A X B = C. So each GPU thread block will handle a sub-matrix of C. I’m going to load a “block” of the matrix A and B into the shared memory. Here is my code. The tidx and tidy are the Thread Block level. So in each Thread Block, I want to load a sub-matrix of A and B into the shared memory. I’ve tried to use the cache() to load a sub-matrix, but it seems that the compiler(I’m using PGI 15.7) ignored the directive. So how can I fix this? And the second question is, I am not sure if cache() works for device pointer? Can I allocate a buffer on device directly and use the cache() to load this buffer into the shared memory?

for (k=0; k< block_num; k++)
     {
      #pragma acc kernels
      #pragma acc loop gang independent 
      for (tidx=0;tidx<block_num;tidx++)
      {
        #pragma acc loop gang independent  
        for (tidy=0;tidy<block_num;tidy++)
        { 
           #pragma acc cache(a_a[tidx*block_size:block_size][k*block_size:block_size],b_a[k*block_size:block_size][tidy*block_size:block_size])
           {     
              #pragma acc loop independent vector(16) 
              for (i=0; i< block_size; i++)
              {
                #pragma acc loop independent vector(16) 
                for (j=0; j< block_size; j++)
                {
                // #pragma acc cache(a_a[tidx*block_size+i][k*block_size+j],b_a[k*block_size+i][tidy*block_size+j])  
                  for (kk=0;kk<block_size;kk++)  
                  {      
                    c_a[tidx*block_size+i][tidy*block_size+j]+=p_alpha*a_a[tidx*block_size+i][k*block_size+kk]*b_a[k*block_size+kk][tidy*block_size+j];
                  }
                }    
             }
           }
          
        }
      }
     }

Hi,

We are working on some issues with our support of the cache directive, and you may be running into some of the outstanding problems.

How are your C arrays declared?

You may have better luck if, as you asked about, declare an array that is the size of your sub-matrix, use it in the private clause of the gang level loop, and explicitly load the contents of the sub-matrix array within a for-loop. Do that for-loop in a vector loop. Also include the entire sub-matrix in your cache clause, similar to what you’ve done below with the original matrix.

We are actively working on fixing these cache clause issues in our compiler.
If you include a compile-able example we may be able to offer more help.

OK. Thank you.

I can provide the compile-able example. Could you give me your email so that I can send you the code with the Makefile?

You can send the example to PGI Customer Service at trs@pgroup.com

I just mailed you an example using temporary, private arrays that the compiler will put in shared memory.