Questions on cache directive capability

Jing_Li · June 15, 2014, 9:55am

Hi all,
I’ve been trying to use cache directive of OpenACC, and I read a few earlier posts on this. After a few try and error, I find it seems that cache directive can handle situations with contiguous arrays (like this code:). Does anyone has any deeper knowledge and experience on cache directive capability? For example, is it possible to use shared memory in a matrix transpose?

Very grateful for any insight on this.

  #pragma acc kernels loop independent vector(32) 
   for (int i=1;i<N-1;i++) 
      {   
      #pragma acc loop independent vector (32) 
      for (int j=1; j<M-1;j++) 
      {   
         #pragma acc cache (A[i-1:i+1][j-1:j+1]) 
         B[i][j]=0;    
         B[i][j]+=A[i-1][j]; 
         B[i][j]+=A[i-1][j-1]; 
         B[i][j]+=A[i-1][j+1]; 
         B[i][j]+=A[i][j]; 
         B[i][j]+=A[i][j-1]; 
         B[i][j]+=A[i][j+1]; 
         B[i][j]+=A[i+1][j]; 
         B[i][j]+=A[i+1][j-1]; 
         B[i][j]+=A[i+1][j+1]; 
         B[i][j]=B[i][j]/9; 
      }   
      }

Update:
I used this following code to try to use shared memory with cache directive, but it turned out that shared mermory is not utilized at all:

void transpose(const int n, float *A,  float *B) 
{
    int i, j, k;
    float  tmp[n][n];
#pragma acc data present(A[0:n*n], B[0:n*n]) create(tmp[0:n][0:n]) 
    for(k = 0 ; k < Niter ; k ++) 
    {   
#pragma acc kernels loop independent
        for(i = 0; i < n; i++) 
        {   
#pragma acc loop independent
            for(j = 0 ;j < n ;j++)
            {   
#pragma acc cache(tmp[i:i][j:j]) 
                tmp[i][j] = A[i*n+j];
                B[i*n+j] = tmp[j][i];
            }   
        }                                                                                                                                                                  
    }   
}

MatColgrove · June 16, 2014, 3:38pm

Hi Jing Li,

The compiler recognizes “tmp” isn’t needed so is optimizing it away. Hence, no caching.

What type of device are you using? If it’s a Compute Capable 3.5 device, I’d use the textured memory instead. For this you don’t need the “cache” directive, rather simply add “restrict” the declaration of “A” and the compiler and hardware will manage the caching. Look for the “ld.global.nc” instruction in the PTX code to tell if LDG (i.e. texture caching) is being utilized.

Hope this helps,
Mat

% pgcc -ta=tesla:keep,cc35 test.c -Minfo=accel -c -V14.6
transpose:
      4, Generating present_or_copyout(B[1:N-2][1:M-2])
         Generating present_or_copyin(A[:N][:M])
         Generating Tesla code
      5, Loop is parallelizable
      8, Loop is parallelizable
         Accelerator kernel generated
          5, #pragma acc loop gang, vector(32) /* blockIdx.y threadIdx.y */
          8, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
% grep "ld.global.nc" test.n001.ptx
        ld.global.nc.u64        %rd9, [%rd1+8];
        ld.global.nc.u64        %rd13, [%rd3];
        ld.global.nc.u64        %rd16, [%rd3+8];
        ld.global.nc.u64        %rd19, [%rd3+16];

Jing_Li · June 17, 2014, 1:03am

Hi Mat,
That works, thanks.
I came across some other question while trying to use shared memory, hope you could enlighten me.

This is a kernel that performs gaussian blur, data type of array is unsigned char(which works just fine). But when I changed data type to int, I get error like this during execution: call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

#pragma acc data present(original_image[:cols][:rows],blur_image_gpu[:cols][:rows])
        {
            int iy, ix;
#pragma acc kernels
#pragma acc loop independent
            for (iy = 0; iy < rows; iy++)
            {
                blur_image_gpu[iy * cols] = 0;
                blur_image_gpu[iy * cols + 1] = 0;
                blur_image_gpu[((iy+1) * cols) -1] = 0;
                blur_image_gpu[((iy+1) * cols) -2] = 0;
            }

#pragma acc kernels
#pragma acc loop independent
            for (ix = 2; ix < cols-2; ix++)
            {
                blur_image_gpu[ix] = 0;
                blur_image_gpu[ix + cols] = 0;
                blur_image_gpu[ix + ((rows-1) * cols)] = 0;
                blur_image_gpu[ix + ((rows - 2) * cols)] = 0;
            }


#pragma acc kernels
#pragma acc loop independent
            for (iy = 2; iy < rows - 2; iy++)
            {
#pragma acc loop independent
                for (ix = 2; ix < cols - 2; ix++)
                {
#pragma acc cache(original_image[iy-2:iy+2][ix-2:ix+2]) 
                    blur_image_gpu[ ix + iy * cols] = f * (
                            s0 * original_image[iy * cols + ix] +
                            s1 * ( original_image[(iy) * cols + (ix - 1)] + original_image[(iy) * cols + (ix + 1)] + original_image[(iy - 1) * cols + (ix)] + original_imag
e[(iy + 1) * cols + (ix)]) +
                            s2 * (original_image[(iy - 1) * cols + (ix - 1)] + original_image[(iy - 1) * cols + (ix + 1)] + original_image[(iy + 1) * cols + (ix - 1)] + or
iginal_image[(iy + 1) * cols + (ix + 1)]) +
                            s4 * ( original_image[(iy) * cols + (ix - 2)] + original_image[(iy) * cols + (ix + 2)] + original_image[(iy - 2) * cols + (ix)] + original_imag
e[(iy + 2) * cols + (ix)]) +                  
                            s5 * (original_image[(iy - 1) * cols + (ix - 2)] + original_image[(iy - 2) * cols + (ix - 1)] + original_image[(iy - 2) * cols + (ix + 1)] + or
iginal_image[(iy - 1) * cols + (ix + 2)] + original_image[(iy + 1) * cols + (ix - 2)] + original_image[(iy + 2) * cols + (ix - 1)] + original_image[(iy + 2) * cols + (ix +
 1)] + original_image[(iy + 1) * cols + (ix + 2)]) +
                            s8 * (original_image[(iy - 2) * cols + (ix - 2)] + original_image[(iy - 2) * cols + (ix + 2)] + original_image[(iy + 2) * cols + (ix - 2)] + or
iginal_image[(iy + 2) * cols + (ix + 2)]) 
                            );
                }
            }
        }

When I tried another program hotspot (a stencil program which simulate chip temperature), I got this compilation error, which was confusing, I don’t know what I did wrong. Compilation error:
PGC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected load datatype (acc_hotspot.c: 132)
compute_tran_temp_gpu:
127, Generating present(temp[:col])
Generating present(power[:col])
Generating present(result[:col])
133, Loop is parallelizable
135, Loop is parallelizable
Accelerator kernel generated
133, #pragma acc loop gang /* blockIdx.y /
Cached references to size [3x(x+2)] block of ‘temp’
135, #pragma acc loop gang, vector(128) / blockIdx.x threadIdx.x /
199, Loop is parallelizable
201, Loop is parallelizable
Accelerator kernel generated
199, #pragma acc loop gang / blockIdx.y /
201, #pragma acc loop gang, vector(128) / blockIdx.x threadIdx.x */
PGC/x86-64 Linux 14.6-0: compilation completed with severe errors
Code:

#pragma acc data present(temp[0:row][:col], power[0:row][:col], result[0:row][:col]) 
    for (int i = 0; i < num_iterations ; i++)
    {   
        double delta;
        int r, c;
#pragma acc kernels loop independent
        for (r = 1; r < row - 1; r++) {
#pragma acc loop independent
            for (c = 1; c < col - 1; c++) {
#pragma acc cache(temp[r-1:r+1][c-1:c+1])
                /*  Corner 1    */
                if ( (r == 0) && (c == 0) ) { 
                    delta = (step / Cap) * (power[0] +
                            (temp[1] - temp[0]) / Rx +
                            (temp[col] - temp[0]) / Ry +
                            (amb_temp - temp[0]) / Rz);
                }   /*  Corner 2    */
                else if ((r == 0) && (c == col-1)) {
                    delta = (step / Cap) * (power[c] +
                            (temp[c-1] - temp[c]) / Rx +
                            (temp[c+col] - temp[c]) / Ry +
                            (amb_temp - temp[c]) / Rz);
                }   /*  Corner 3    */
                else if ((r == row-1) && (c == col-1)) {
                    delta = (step / Cap) * (power[r*col+c] + 
                            (temp[r*col+c-1] - temp[r*col+c]) / Rx + 
                            (temp[(r-1)*col+c] - temp[r*col+c]) / Ry + 
                            (amb_temp - temp[r*col+c]) / Rz);                   
                }   /*  Corner 4    */
                else if ((r == row-1) && (c == 0)) {
                    delta = (step / Cap) * (power[r*col] + 
                            (temp[r*col+1] - temp[r*col]) / Rx + 
                            (temp[(r-1)*col] - temp[r*col]) / Ry + 
                            (amb_temp - temp[r*col]) / Rz);
                }   /*  Edge 1  */
                else if (r == 0) {
                    delta = (step / Cap) * (power[c] + 
                            (temp[c+1] + temp[c-1] - 2.0*temp[c]) / Rx + 
                            (temp[col+c] - temp[c]) / Ry + 
                            (amb_temp - temp[c]) / Rz);
                }   /*  Edge 2  */
                else if (c == col-1) {
                    delta = (step / Cap) * (power[r*col+c] + 
                            (temp[(r+1)*col+c] + temp[(r-1)*col+c] - 2.0*temp[r*col+c]) / Ry + 
                            (temp[r*col+c-1] - temp[r*col+c]) / Rx + 
                            (amb_temp - temp[r*col+c]) / Rz);       
                }   /*  Edge 3  */
                else if (r == row-1) {
                    delta = (step / Cap) * (power[r*col+c] + 
                            (temp[r*col+c+1] + temp[r*col+c-1] - 2.0*temp[r*col+c]) / Rx + 
                            (temp[(r-1)*col+c] - temp[r*col+c]) / Ry + 
                            (amb_temp - temp[r*col+c]) / Rz);
                }   /*  Edge 4  */
                else if (c == 0) {
                    delta = (step / Cap) * (power[r*col] + 
                            (temp[(r+1)*col] + temp[(r-1)*col] - 2.0*temp[r*col]) / Ry + 
                            (temp[r*col+1] - temp[r*col]) / Rx + 
                            (amb_temp - temp[r*col]) / Rz);
                }   /*  Inside the chip */
                else {
                    delta = (step / Cap) * (power[r*col+c] + 
                            (temp[(r+1)*col+c] + temp[(r-1)*col+c] - 2.0*temp[r*col+c]) / Ry + 
                            (temp[r*col+c+1] + temp[r*col+c-1] - 2.0*temp[r*col+c]) / Rx + 
                            (amb_temp - temp[r*col+c]) / Rz);
                }

                /*  Update Temperatures */
                result[r*col+c] =temp[r*col+c]+ delta;
            }
        }

#pragma acc kernels loop independent
        for (r = 0; r < row; r++) {
#pragma acc loop independent
            for (c = 0; c < col; c++) {
                temp[r*col+c]=result[r*col+c];
            }
        }
    }

MatColgrove · June 17, 2014, 8:26pm

Hi Jing,

Mind sending reproducing examples to PGI Customer Service (trs@pgroup.com) and asking them to forward them to me?

The error “Illegal address during kernel execution” is typically an out-of-bounds access violation or some other illegal memory access, though what’s causing it I can’t tell from what you’ve posted.

For #2, “Unexpected load datatype”, that’s probably a compiler error but again I’d need a reproducing example to understand exactly what’s happening.

Thanks,
Mat