Hello, I faced with a problem when tried to use #pragma acc cache directive. It is considrered to cache specified data into shared memory of GPU . But compiler log says, my kernel doesn’t use shared memory:
main:
12, Generating copyout(B[0:N-1][0:M-1])
Generating copyin(A[0:N-1][0:M-1])
15, Generating present_or_copyin(A[0:N-1][0:M-1])
Generating present_or_copyout(B[0:N-1][0:M-1])
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
17, Loop is parallelizable
20, Loop is parallelizable
Accelerator kernel generated
17, #pragma acc loop gang, vector(32) /* blockIdx.y threadIdx.y */
Cached references to size [(y+2)x(x+2)] block of 'A'
20, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
CC 1.0 : 26 registers; 48 shared, 20 constant, 0 local memory bytes
CC 2.0 : 22 registers; 0 shared, 68 constant, 0 local memory bytes
Actually it should use at least 324 shared memory variables to cache block size of [(16+2)x(16+2)].
Here is the code:
#include <openacc.h>
#include <stdio.h>
#include <stdlib.h>
void main()
{
int A[1000][1000];
int B[1000][1000];
int N=1000;//count of elements
int M=1000;//count of elements
#pragma acc data copyin (A[0:N-1][0:M-1]), copyout(B[0:N-1][0:M-1])
{
#pragma acc kernels loop independent vector(32)
{
for (int i=1;i<N-1;i++)
{
#pragma acc loop independent vector (32)
for (int j=1; j<M-1;j++)
{
//#pragma acc cache (A[i-1:i+1][j-1:j+1])
B[i][j]=0;
B[i][j]+=A[i-1][j];
B[i][j]+=A[i-1][j-1];
B[i][j]+=A[i-1][j+1];
B[i][j]+=A[i][j];
B[i][j]+=A[i][j-1];
B[i][j]+=A[i][j+1];
B[i][j]+=A[i+1][j];
B[i][j]+=A[i+1][j-1];
B[i][j]+=A[i+1][j+1];
B[i][j]=B[i][j]/9;
}
}
}
}
}
Could you please tell me, how to place data into GPU shared memory?
Yours sincerely,
Alex Ivakhnenko