Problems with the gettime() function on OpenACC

1446561717 · April 23, 2023, 6:26pm

Hello Mat, I tried using gettime() to compare CPU and GPU runtime, but I found that sometimes gettime() does not provide good timing, as shown in the following example

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <math.h>
#include <openacc.h>
#include <accelmath.h>

#if defined(_WIN32) || defined(_WIN64)
#include <sys/timeb.h>
#define gettime(a) _ftime(a)
#define usec(t1,t2) ((((t2).time-(t1).time)*1000+((t2).millitm-(t1).millitm))*100)
typedef struct _timeb timestruct;
#else
#include <sys/time.h>
#define gettime(a) gettimeofday(a,NULL)
#define usec(t1,t2) (((t2).tv_sec-(t1).tv_sec)*1000000+((t2).tv_usec-(t1).tv_usec))
typedef struct timeval timestruct;
#endif
int main(int argc, char* argv[])
{

int n=10000000;      /* size of the vector */
    float *a;  /* the vector */
    float *r;  /* the results */
    float *e;  /* expected results */
    float s, c;
    timestruct t1, t2, t3;
    long long cgpu, chost;
    int i, j;
    

    a = (float*)malloc(n*sizeof(float));
    r = (float*)malloc(n*sizeof(float));
    e = (float*)malloc(n*sizeof(float));
    for( i = 0; i < n; ++i ) a[i] = (float)(i+1) * 2.0f;
    acc_init( acc_device_nvidia );


    gettime( &t1 );

    #pragma acc loop independent
    for( i = 0; i < n; ++i ){
	s = sinf(a[i]);
	c = cosf(a[i]);
	r[i] = s*s + c*c;
    }
    
    gettime( &t2 );
    
    cgpu = usec(t1,t2);
    for( i = 0; i < n; ++i ){
	s = sinf(a[i]);
	c = cosf(a[i]);
	e[i] = s*s + c*c;
    }
    
    gettime( &t3 );
    chost = usec(t2,t3);

    printf( "%13d iterations completed\n", n );
    printf( "%13ld microseconds on GPU\n", cgpu );
    printf( "%13ld microseconds on host\n", chost );
    
    return 0;
}

The compiled run results are as follows:

"pwz66.c", line 31: warning: variable "j" was declared but never referenced [declared_but_not_referenced]
      int i, j;
             ^

Remark: individual warnings can be suppressed with "--diag_suppress <warning-name>"

     10000000 iterations completed
            0 microseconds on GPU
            0 microseconds on host

But if I modify the OpenACC introduction, it seems normal



#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <math.h>
#include <openacc.h>
#include <accelmath.h>

#if defined(_WIN32) || defined(_WIN64)
#include <sys/timeb.h>
#define gettime(a) _ftime(a)
#define usec(t1,t2) ((((t2).time-(t1).time)*1000+((t2).millitm-(t1).millitm))*100)
typedef struct _timeb timestruct;
#else
#include <sys/time.h>
#define gettime(a) gettimeofday(a,NULL)
#define usec(t1,t2) (((t2).tv_sec-(t1).tv_sec)*1000000+((t2).tv_usec-(t1).tv_usec))
typedef struct timeval timestruct;
#endif
int main(int argc, char* argv[])
{

int n=10000000;      /* size of the vector */
    float *a;  /* the vector */
    float *r;  /* the results */
    float *e;  /* expected results */
    float s, c;
    timestruct t1, t2, t3;
    long long cgpu, chost;
    int i, j;
    

    a = (float*)malloc(n*sizeof(float));
    r = (float*)malloc(n*sizeof(float));
    e = (float*)malloc(n*sizeof(float));
    for( i = 0; i < n; ++i ) a[i] = (float)(i+1) * 2.0f;
    acc_init( acc_device_nvidia );


    gettime( &t1 );

    #pragma acc kernels copyin(a[0:n])\
                create(s[0:n],c[0:n])\
                copyout(r[0:n])
    #pragma acc loop independent
    for( i = 0; i < n; ++i ){
	s = sinf(a[i]);
	c = cosf(a[i]);
	r[i] = s*s + c*c;
    }
    
    gettime( &t2 );
    
    cgpu = usec(t1,t2);
    for( i = 0; i < n; ++i ){
	s = sinf(a[i]);
	c = cosf(a[i]);
	e[i] = s*s + c*c;
    }
    
    gettime( &t3 );
    chost = usec(t2,t3);

    printf( "%13d iterations completed\n", n );
    printf( "%13ld microseconds on GPU\n", cgpu );
    printf( "%13ld microseconds on host\n", chost );
    
    return 0;
}

give the result as follows

"pwz66.c", line 31: warning: variable "j" was declared but never referenced [declared_but_not_referenced]
      int i, j;
             ^

Remark: individual warnings can be suppressed with "--diag_suppress <warning-name>"

main:
     47, Generating copyout(r[:n]) [if not already present]
         Generating create(s) [if not already present]
         Generating copyin(a[:n]) [if not already present]
         Generating create(c) [if not already present]
         Loop is parallelizable
         Generating implicit private(s,c)
         Generating NVIDIA GPU code
         47, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

     10000000 iterations completed
        33198 microseconds on GPU
       320570 microseconds on host

I want to know why it causes timing errors？

MatColgrove · April 24, 2023, 4:05pm

In the first example, the code isn’t getting offloaded. A “loop” directive can only be used within or compounded with, a compute region, i.e. a “parallel” or “kernels” directive. Otherwise it’s ignored.

Why you’re not seeing any times, I’m not sure. Seems to be ok for me, though the times are the same since you’re not actually using the GPU:

% a.out
     10000000 iterations completed
        74205 microseconds on GPU
        74014 microseconds on host

While I doubt it’s happening here, dead-code elimination has the potential to remove both loops give the results of “r” and “e” aren’t used. Hence it’s recommended to at least print one element from each array, or use the result in some way to avoid the compiler applying this optimization.

-Mat

Topic		Replies	Views
finding executed time using PGI_ACC_TIME Legacy PGI Compilers	1	2653	February 10, 2014
cuModuleGetGlobal error Legacy PGI Compilers	12	6685	December 21, 2012
Launch of the kernel Legacy PGI Compilers	4	2938	October 18, 2017
performance of PGI openacc directives Legacy PGI Compilers	9	5131	March 6, 2013
Fortran90 / OpenACC / Multi GPU Code Time measure nvc, nvc++ and nvfortran	7	721	February 26, 2024
Check performance Legacy PGI Compilers	4	3316	September 28, 2017
GPU time measuring using accel.h routines PGI 20.1 Legacy PGI Compilers	5	754	May 29, 2020
OpenACC runtime timings Legacy PGI Compilers	1	2073	August 16, 2012
Profiling OpenACC Legacy PGI Compilers	7	3896	May 30, 2019
OpenACC diff between GPU + CPU codes Legacy PGI Compilers	5	4105	May 31, 2012

Problems with the gettime() function on OpenACC

Related topics