HOW TO USE'UPDATE'

teslalady · September 7, 2012, 4:03pm

I HAVE A CODE AS BELOW;

for (k=0;k<n;k++)
        {    
                temp=k;
                       for(p=k;p<n;p++)
                {
                         if( fabs(a[p*n+k])> fabs(a[temp*n+k]))
                                   temp=p;    
                }
                for(i=0;i<n;i++)
                { 
                        change=a[k*n+i];
                         a[k*n+i]=a[temp*n+i];
                         a[temp*n+i]=change;
                }    
             
                

                    if(a[k*n+k]==0)
                        return 0;
                [color=Red]for(j=k+1;j<n;j++)
                {
                        a[j*n+k]=a[j*n+k]/a[k*n+k];
                }
                         
                for(j=k+1;j<n;j++)
                {           
                        
                        for(i=k+1;i<n;i++)
                        {
                                a[j*n+i]=a[j*n+i]-a[j*n+k]*a[k*n+i];    
                        }                        
                }[/color]
                
          }

I want to ACELERATE the red area that i marked,so I oPTiMIzIED the coDE As BelOW:

for (k=0;k<n;k++)
        {    
                temp=k;
                       for(p=k;p<n;p++)
                {
                         if( fabs(a[p*n+k])> fabs(a[temp*n+k]))
                                   temp=p;   
                }
                for(i=0;i<n;i++)
                { 
                        change=a[k*n+i];
                         a[k*n+i]=a[temp*n+i];
                         a[temp*n+i]=change;
                }    
             
                

                    if(a[k*n+k]==0)
                        return 0;
                [color=Red]#pragma acc data  copy(a[:N*N])
                {        
                        #pragma acc kernels
                        {
                                #pragma acc loop independent
                                for(j=k+1;j<n;j++)
                                {
                                        a[j*n+k]=a[j*n+k]/a[k*n+k];
                                }
                        }
                         #pragma acc kernels 
                        { 
                                #pragma acc loop independent
                                for(j=k+1;j<n;j++)
                                {           
                                        #pragma acc loop independent
                                        for(i=k+1;i<n;i++)
                                        {
                                                a[j*n+i]=a[j*n+i]-a[j*n+k]*a[k*n+i];    
                                        }                        
                                }
                        }
                }[/color]
         }

it can be compiled succesfully,but it seemed be not accelerated.i found there should be somthing wrong with #pragma acc data copy(a[:N*N]) it need copy daTA For eaCh k lOOP.i know that ‘updaTE’ COuLD be used for updating data,so how can i use it in my code/[/quote]

MatColgrove · September 7, 2012, 10:24pm

Hi Sisiy,

The update directive would be used within a data region to synchronize the device and host copies of the data. Typically for this type of code I would recommend you put the data region outside so that you’re not copying “a” back and forth between the GPU and host for every iteration of the “k” loop. However, if you put the update directives both before and after the compute regions, you’re no better off then using the inner data region.

Ideally, you’d put all computation performed on “a” into a compute region. Even in this code, where you have sequential regions, it’s better to put these sequential regions on the GPU rather than copy “a” back and forth each time. To illustrate, I wrote the example bellow. Note that without the outer data region (like what you have), my runtime on the GPU was 448 seconds with 440 of those spent copying data (versus 65 seconds when compiled for the host). After moving the data region outside of the “k” loop and adding the sequential region, the GPU time dropped to 9 seconds with the data being 0.008 seconds of that.

The one caveat is that I removed the inner “if” check. While you can add an update directive here to return this element of “a”, your runtime will degrade. Also I question if this if statement is effective given it’s a comparison of absolute equality to a floating point number.

Here’s my example code:

% cat test_update_1.c

#include <stdlib.h>
#include <stdio.h>
#include <malloc.h>
#ifdef _OPENACC
#include <accelmath.h>
#endif
int main ()  {

double * a;
double change;
int i,j,k,temp,p,n;

n = 5000;

a = (double*) malloc(n*n*sizeof(double));
srand(123);
for (i = 0; i < n*n; ++i) {
   a[i] = (double) rand() / (double) RAND_MAX;
}
 

#pragma acc data copy(a[0:n*n])
{

for (k=0;k<n;k++)
        {   
                //temp=k;
#pragma acc kernels 
{
                for(p=temp=k;p<n;p++)
                {
                         if( fabs(a[p*n+k])> fabs(a[temp*n+k]))
                                   temp=p;   
                }
#pragma acc loop gang vector independent
                for(i=0;i<n;i++)
                {
                        change=a[k*n+i];
                         a[k*n+i]=a[temp*n+i];
                         a[temp*n+i]=change;
                }   
 }            
               
/*
                    if(a[k*n+k]==0)
                        break;
*/

#pragma acc kernels loop independent
                for(j=k+1;j<n;j++)
                {
                        a[j*n+k]=a[j*n+k]/a[k*n+k];
                }
                         
#pragma acc kernels loop independent
                for(j=k+1;j<n;j++)
                {           
#pragma acc loop independent
                        for(i=k+1;i<n;i++)
                        {
                                a[j*n+i]=a[j*n+i]-a[j*n+k]*a[k*n+i];   
                        }                       
                }
}
              
          }

for (i = 0; i < n*n; i+=((n/10)*n)) {
    printf ("%d: %g \n", i, a[i]);
}

free(a);
}

% pgcc -fast test_update_1.c -Minfo=accel -o test_update_gpu1.out -acc -V12.8 -Msafeptr
main:
     23, Generating copy(a[0:n*n])
     29, Generating copy(a[0:n*n])
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     31, Loop carried scalar dependence for 'temp' at line 33
         Scalar last value needed after loop for 'temp' at line 40
         Scalar last value needed after loop for 'temp' at line 41
         Accelerator restriction: scalar variable live-out from loop: temp
         Accelerator kernel generated
         31, CC 1.3 : 12 registers; 36 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 2 registers; 0 shared, 60 constant, 0 local memory bytes
     37, Loop is parallelizable
         Accelerator kernel generated
         37, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             CC 1.3 : 11 registers; 48 shared, 4 constant, 0 local memory bytes
             CC 2.0 : 16 registers; 0 shared, 72 constant, 0 local memory bytes
     50, Generating copy(a[0:n*n])
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     51, Loop is parallelizable
         Accelerator kernel generated
         51, #pragma acc loop gang /* blockIdx.x */
             CC 1.3 : 23 registers; 44 shared, 0 constant, 0 local memory bytes
             CC 2.0 : 24 registers; 0 shared, 76 constant, 0 local memory bytes
     56, Generating copy(a[0:n*n])
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     57, Loop is parallelizable
     60, Loop is parallelizable
         Accelerator kernel generated
         57, #pragma acc loop gang /* blockIdx.y */
         60, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             CC 1.3 : 14 registers; 52 shared, 0 constant, 0 local memory bytes
             CC 2.0 : 20 registers; 0 shared, 68 constant, 0 local memory bytes
% time test_update_gpu1.out
0: 0.0600514 
2500000: 15.6055 
5000000: 0.469559 
7500000: 12.7688 
10000000: 5.39385 
12500000: 14.9757 
15000000: 11.6148 
17500000: 2.55971 
20000000: 10.9453 
22500000: 15.9231 

Accelerator Kernel Timing data
/proj/qa/support/sisiy/test_update_1.c
  main
    56: region entered 5000 times
        time(us): total=7,525,391 init=317 region=7,525,074
                  kernels=7,480,182
        w/o init: total=7,525,074 max=4,487 min= avg=1,505
        60: kernel launched 4999 times
            grid: [1-40x1-4999]  block: [128]
            time(us): total=7,480,182 max=4,477 min=5 avg=1,496
/proj/qa/support/sisiy/test_update_1.c
  main
    50: region entered 5000 times
        time(us): total=237,549 init=370 region=237,179
                  kernels=193,538
        w/o init: total=237,179 max=145 min= avg=47
        51: kernel launched 4999 times
            grid: [1-4999]  block: [1]
            time(us): total=193,538 max=78 min=5 avg=38
/proj/qa/support/sisiy/test_update_1.c
  main
    29: region entered 5000 times
        time(us): total=200,767 init=324 region=200,443
                  kernels=47,476
        w/o init: total=200,443 max=762 min=37 avg=40
        31: kernel launched 5000 times
            grid: [1]  block: [1]
            time(us): total=15,473 max=13 min=3 avg=3
        37: kernel launched 5000 times
            grid: [40]  block: [128]
            time(us): total=32,003 max=16 min=6 avg=6
/proj/qa/support/sisiy/test_update_1.c
  main
    23: region entered 1 time
        time(us): total=9,212,921 init=1,158,163 region=8,054,758
                  data=76,850
        w/o init: total=8,054,758 max=8,054,758 min=8,054,758 avg=8,054,758
4.651u 5.057s 0:09.87 98.2%	0+0k 0+0io 0pf+0w

Hope this helps,
Mat

Topic		Replies	Views
update directive - how to? Legacy PGI Compilers	3	2365	November 29, 2012
update vs copy Legacy PGI Compilers	1	2986	October 30, 2013
Accelerator Programming Model on NVIDIA GPUs Part 4 Legacy PGI Compilers	4	2396	June 26, 2012
update clause gives error Legacy PGI Compilers	1	1758	March 20, 2012
OpenACC declare and update directives Legacy PGI Compilers	6	10213	November 10, 2012
Error and huge slowdown from !$acc region Legacy PGI Compilers	4	2970	March 26, 2012
data region problem Legacy PGI Compilers	4	4984	August 3, 2010
update and mirror problems Legacy PGI Compilers	2	2947	November 19, 2011
Avoid reallocating memory on the GPU. Legacy PGI Compilers	11	5648	January 28, 2013
pgi13.3 PB , mirror/host data asynchronously updated ?? Legacy PGI Compilers	7	5866	May 17, 2013

HOW TO USE'UPDATE'

Related topics