Managing Shared Memory

Is it possible to manage shared memory? There is a clause cache that can be used to hint the compiler. I have the following acc region. the kreal and kimag values of a particular point are stored in separate array (ktempreal,ktempimag) and used in all the nterms iterations. The idea here is to move this array to shared memory.

!$acc region 
!$acc do kernel private (ktempreal, ktempimag) cache(ktempreal,ktempimag)
        DO p = 1,npoints            
            !store k values in ktemp array
            DO r = 1,order+1
              DO mu = 0,NDIR-1
                ktempreal(mu,r) = kreal(mu,r,p) 
                ktempimag(mu,r) = kimag(mu,r,p)
              END DO
            END DO
          ! use the ktemp values for all nterms iterations
          DO i = 1,nterms
            phase3real = 0
            phase3imag = 0
            DO r = 1,order+1
              DO mu = 0,NDIR-1
                phase3real =  phase3real - ktempimag(mu,r) * yxv(mu,r,i)
                phase3imag =  phase3imag + ktempreal(mu,r) * yxv(mu,r,i)
              END DO
            END DO
            vtxgpureal(p) = vtxgpureal(p) + phase3real
            vtxgpuimag(p) = vtxgpuimag(p) + phase3imag
          END DO
        END DO
!$acc end region

The compiler generated messages are as follows

60, Generating copyin(kimag(0:ndir-1,1:order+1,1:npoints))
         Generating copyin(kreal(0:ndir-1,1:order+1,1:npoints))
         Generating copy(vtxgpuimag(0,0,0,1:npoints))
         Generating copy(vtxgpureal(0,0,0,1:npoints))
         Generating copyin(yxv(0:ndir-1,1:order+1,1:nterms))
         Generating compute capability 2.0 binary
     62, Loop is parallelizable
         Accelerator kernel generated
         62, !$acc do parallel, vector(32)
             Non-stride-1 accesses for array 'vtxgpuimag'
             Non-stride-1 accesses for array 'vtxgpureal'
             CC 2.0 : 30 registers; 4 shared, 152 constant, 0 local memory bytes; 16 occupancy
     63, Loop is parallelizable
     64, Loop is parallelizable
     69, Complex loop carried dependence of 'vtxgpureal' prevents parallelization
         Loop carried dependence of 'vtxgpureal' prevents parallelization
         Loop carried backward dependence of 'vtxgpureal' prevents vectorization
         Complex loop carried dependence of 'vtxgpuimag' prevents parallelization
         Loop carried dependence of 'vtxgpuimag' prevents parallelization
         Loop carried backward dependence of 'vtxgpuimag' prevents vectorization
     72, Loop is parallelizable
     73, Loop is parallelizable

Please advice.

Hi Karthee,

Is it possible to manage shared memory?

For this particular code, the ktempreal and ktempimag variables are local (private) arrays. Hence they would be placed in a register unless they were too large or if there were too many other local variables. In this case, the registers may ‘spill’ into the global memory. Registry allocation is performed by the back-end NVIDIA compiler that the user has no control over except in limiting the maximum number of registers to use (“-ta=nvidia,maxregcount:”).

“shared” memory refers to global memory that has been cached in local memory and can be accessed by all threads within a single thread block. CUDA Fortran requires the programmer to manage their own shared memory. For the PGI Accelerator model, the compiler manages the shared memory for you.

The idea here is to move this array to shared memory.

Provided that ktempreal and ktempimag are stored in registers (i.e. not spilled), then essentially you have put them in shared memory since shared memory and registers both are stored in a multi-processor’s local memory.

Another possibility is to have the compiler put ‘kreal’ and ‘kimag’ into shared memory. In other words, remove the temp arrays and access kreal and kimag directly. Though, you’ll need to move the parallel dimension (p) to the first dimension to allow for contigeous memory access.

Hope this helps,
Mat