OpenACC cache directive

Hello, I faced with a problem when tried to use #pragma acc cache directive. It is considrered to cache specified data into shared memory of GPU . But compiler log says, my kernel doesn’t use shared memory:

main:
     12, Generating copyout(B[0:N-1][0:M-1])
         Generating copyin(A[0:N-1][0:M-1])
     15, Generating present_or_copyin(A[0:N-1][0:M-1])
         Generating present_or_copyout(B[0:N-1][0:M-1])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     17, Loop is parallelizable
     20, Loop is parallelizable
         Accelerator kernel generated
         17, #pragma acc loop gang, vector(32) /* blockIdx.y threadIdx.y */
             Cached references to size [(y+2)x(x+2)] block of 'A'
         20, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
             CC 1.0 : 26 registers; 48 shared, 20 constant, 0 local memory bytes
             CC 2.0 : 22 registers; 0 shared, 68 constant, 0 local memory bytes

Actually it should use at least 324 shared memory variables to cache block size of [(16+2)x(16+2)].

Here is the code:

#include <openacc.h>
#include <stdio.h>
#include <stdlib.h>

void main()
{
	int A[1000][1000];
	int B[1000][1000];
	int N=1000;//count of elements
	int M=1000;//count of elements
	
	#pragma acc data copyin (A[0:N-1][0:M-1]), copyout(B[0:N-1][0:M-1])
	{
	
	#pragma acc kernels loop independent vector(32)
	{
	for (int i=1;i<N-1;i++)
		{
		#pragma acc loop independent vector (32)
		for (int j=1; j<M-1;j++)
		{
			//#pragma acc cache (A[i-1:i+1][j-1:j+1])
			B[i][j]=0;			
			B[i][j]+=A[i-1][j];
			B[i][j]+=A[i-1][j-1];
			B[i][j]+=A[i-1][j+1];
			B[i][j]+=A[i][j];
			B[i][j]+=A[i][j-1];
			B[i][j]+=A[i][j+1];
			B[i][j]+=A[i+1][j];
			B[i][j]+=A[i+1][j-1];
			B[i][j]+=A[i+1][j+1];
			B[i][j]=B[i][j]/9;
		}
		}	
	}	
	}
}

Could you please tell me, how to place data into GPU shared memory?

Yours sincerely,
Alex Ivakhnenko

Hi Alex,

You have it correct. The ptxas information shown is only for static shared memory. The shared memory you’re using is dynamically allocated at launch and is adjusted to match the thread block size.

cache block size of [(16+2)x(16+2)].

In this case it’s actually [(32+2)x(32+2)]

Hope this helps,
Mat