Questions on incorrect results with openacc in GPU

Hello,

I am new to GPU programming and trying to port my code to GPU. I started from porting one function of my code to GPU with openacc and I got wrong results.

Before entering this function, I move data explicitly to the device and then copy some arrays back to the host

        #pragma acc enter data copyin(sifq[0:2*nfc],sxq[0:nx*nq],sq[0:nv*nq],sdxdx[0:nx*nx*nq],sdqdx[0:nx*nv*nq], \
                                                                saux[0:naux*nq],srhs[0:nv*nq],sxc[0:nx*nfc],swnc[0:(nx+1)*nfc],swxdc[0:nfc], \
                                                                sauxf[0:nauxf*nfc])
        obj->compute( ics,ice, sifq, sxq,sq,sdxdx,sdqdx,saux,srhs,
                                   sxc,swnc,swxdc,sauxf, nfc, nq );
        #pragma acc exit data copyout (srhs[0:nv*nq],sauxf[0:nauxf*nfc])

In this this “compute” function:

 void cClass::compute( Int ics,Int ice, Int *sicq,  Real *sxq, Real *sq, Real *sdxdx, Real *sdqdx, Real *saux, Real *srhs,
                                       Real *sxc, Real *swc, Real *swxdc, Real *sauxc, Int nfc, Int nq )
{
  Real            dql[MxNVs],dqr[MxNVs];
  Real            dql0[MxNVs],dqr0[MxNVs];
  Real            auxl[MxNVs],auxr[MxNVs];
  Int             ix,iql,iqr,ia;
  Real            xn[3],wn[3];
  Real            al,unl,rl,ll1,ll3,ll4,pl,hl,fl[MxNVs],ql[MxNVs],dhl,dal;
  Real            ar,unr,rr,lr1,lr3,lr4,pr,hr,fr[MxNVs],qr[MxNVs],dhr,dar;
  Real            le1,le3,le4;
  Real            aa,a2a,ra,ha,ka,qa[MxNVs],ana[3],una,unaa,raa, la1,la4,la3,lmax,fa[MxNVs];
  Real            dw1,dw3,dw4,dw2[3],dw2a,dr,dq[MxNVs],dun,dp,dpa;
  Real            dur[3],dunr,dpr,drr,dtr,tr,dt;
  Real            dwl[MxNVs],dwr[MxNVs];
  Real            mr,ml,wl,wr,cp;
  Real            f[MxNVs], dw5[MxNVs];

  cp= some_value;
  #pragma acc parallel loop \
   private(dql0,dqr0,\
           dql,dqr,\
           auxl,auxr,\
           ix,iql,iqr,ia,\
           xn,wn,\
           al,unl,rl,ll1,ll3,ll4,pl,hl,fl,ql,dhl,dal, \
           ar,unr,rr,lr1,lr3,lr4,pr,hr,fr,qr,dhr,dar, \
           le1,le3,le4, \
           aa,a2a,ra,ha,ka,qa,ana,una,unaa,raa, la1,la4,la3,lmax,fa, \
           dw1,dw3,dw4,dw2,dw2a,dr,dq,dun,dp,dpa, \
           dur,dunr,dpr,drr,dtr,tr,dt, \
           dwl,dwr, \
           mr,ml,wl,wr,cp, \
           f, dw5 ) \ 
           present(sq[0:nv*nq],sauxc[0:nauxf*nfc], saux[0:naux*nq], \
                   sxq[0:nx*nq],srhs[0:nv*nq],sdqdx[0:nv*nx*nq],sicq[0:2*nfc], \
                   sxc[0:nx*nfc],swc[0:(nx+1)*nfc],swxdc[0:nfc],sdxdx[0:nx*nx*nq])

           for( Int ic=ics;ic<ice;ic++ )
          {
                //some computations
                .....
               //scatter results to srhs
              #pragma acc atomic
                srhs[ADDR_(0,iql,nq)]-= f[0];
                 .....

               #pragma acc atomic
              srhs[ADDR_(0,iqr,nq)]+= f[0];
              .....
        }
}

In the function, “Real” is “double” and “Int” is “int”, “ADDR_” is a macro function.

Since I am only modifying this function, If I got the wrong results from the whole code in the end, the error can only come from this function. I have tried to compile the code in the CPU with “-acc=multicore” and I have the correct results. It is when I compile it in the GPU, the results become wrong but they don’t change if I run the code a few times, so I think the “atomic” seems working.

I am using a V100 card and my hpc_sdk version is 22.2. I compile the code with the following option: -fast -acc -Minfo=accel -gpu=cc70,nordc $(DEFS) -fPIC -Wall

Thanks for your help in advance,
Feng

Hi Feng,

Unfortunately there’s not enough information here to give a definitive answer as to why you’re getting incorrect answers. These are often caused by incorrect data synchronization, race conditions in the code, or differences in order of operations when running massively parallel.

How “wrong” are the answers? Very wrong or just slightly off?

For very wrong, try adding “-gpu=managed” so the CUDA driver will handle the data movement of allocated data. If this fixes the problem, look for missing arrays in your data region.

Since subtraction isn’t commutative and atomics don’t enforce order of operations, this could be an issue depending on the values.

Also, rounding error can differ depending on the order of operations. Hence if the answers are slightly wrong, this could be a factor. It may not be as prominent with multicore given there are far fewer threads.

For race conditions, look for cases where you’re missing the atomic or not privatizing an array.

Note that scalars are firstprivate by default (with a few exceptions) so including them in the “private” clause isn’t necessary. Normally it doesn’t hurt including them in a private clause (just save some typing and cleans up the code a bit), but if any of the values need to be initialized, then you’ll want to use “firstprivate” instead. “firstprivate” data is initialized to the value from the host, while “private” data is not.

Finally, you shouldn’t need to use “nordc” unless your putting this code into a shared object or linking with a different compiler. This wouldn’t cause the wrong answers, but inhibits the use of device function calls or using global data.

If none of these suggestions help find the issue, you’ll need to provide a reproducing example. Preferably as small, but I don’t mind looking at full applications if you’re unable to reduce the example.

Hope this helps,
Mat

Hi Mat,

Many thanks for your help!

I am compiling with “nordc” because my code is a set of libraries. and they are linked dynamically. This function “compute” is one of the hot spots.

I tried your suggestions one by one and the one works for me is to include “managed”. So it is a data movement issue.

I believe “managed” only works for the arrays that are on the heap. For the functions which I try to parallel, It only takes the dynamic arrays from its argument list and I’ve double-checked the arrays in the argument list of “compute” with “enter data copyin” and “present()”, I think they are all there and consistent. I am wondering what I have missed? or Is there a way to detect this?

Many thanks for your help in advance.
Feng

I just had a thought. Is “compute” called multiple times?

You only have three of the arrays in the exit data directive, hence the others would remain present on the device. Hence if called a second time, they wouldn’t be copied in. Assuming the values changed on the host, they would then be incorrect the second time through. To fix, you’d want to put the others within a “delete” clause so they are removed from the device.

Is there a way to detect this?

There’s a couple of ways:

  • Look at the compiler feedback messages (-Minfo=accel) to see if there are any implicit copies.
  • Add “default(none)” to the parallel loop. This does mean that you need to make sure to include all arrays in a data clause, but the compiler will give an error if you’re missing one.
  • Add “default(present)”. This says that all arrays must be present, implicit or explicitly in a data clause. If not, during execution the runtime will give an error.

Personally I use “default(present)” quite often, mostly so I don’t need to type as much, but it’s also nice to use “default(none)” if you want the error to occur at compile time.

Another thing I like to do with libraries is write a few routines which handle the creation/deletion of the array, as well as two others to handle the “updates”. This does mean the main program needs to call these routines, which if forgotten can led to problems, but allows for greater control of when and where data is synchronized with the device giving better performance. No idea if this design is right for your library, but something to consider.

Hi Mat,

Hope you had a good weekend!

Thanks for your advice, it is very helpful. Yes, this function is called many times. I did not realize that “copyin” will not actually copy the data from the host to the device if that pointer is already in the device. I have added a “delete” clause after I do “copyout”, so the results now look good to me. This is a cumbersome fix, but it helps me to understand how things work.

I think I need to create functions like “create_device”, “delete_device”, “update_host(or device)” to do it in a clean way.

Once again, thanks for your help!
Feng

Yes, the “present_or” semantics is probably the most difficult thing, especially for new OpenACC users, to understand.

Basically, all “copy” clauses will only create and copy the data if it is not already present on the device. If it is present, only the reference counter is incremented.

If the arrays were created each time, in this case you’d end up with multiple copies of the arrays on the device. Just like if you malloc a pointer twice, without freeing the first allocation, you’d end up with a memory leak and non accessible data. Note, just like malloc/free or new/delete, you should always have data in an “enter data” directive, included in a corresponding “exit data” directive.

I think I need to create functions like “create_device”, “delete_device”, “update_host(or device)” to do it in a clean way.

Yes, this is my preference. You could have solved the above issue using a structured data region as well, but then data would be created and copied each time in the function is called. Since data movement is often the most expensive part of using a GPU, it’s best to give the user control on when the device data is created and copied.

-Mat

Hi Mat,

Thanks for your advice. It has been very helpful!

Eventually, I would like to offload the data to the GPU at the beginning, copy some data back to the host from the device to monitor the simulation, and copy the relevant data back at the end of the computation. I have quite a few functions and I am porting the functions one by one and the one mentioned above is just one of them. For each function, I do:

         #pragma acc enter data copyin (....)
         call_my_function
        #pragma acc exit data copyout (.....)
        #pragma acc exit data delete (......)

I have been getting consistent results with my results from CPU. The “copyin”, “copyout” and “delete” are intermediate steps, Eventually they will be removed. there will be a single one when I start the computation and a single one at the very end of the computation. Besides, I thought “delete” will free the memory (which was created by “copyin”) in the device, am I correct?

I have another questions on moving data from the host to the device, suppose I have a 2D array, it is A[ig][ip]. ig is in the range of [0,ng) and ip is in the range of [0:nbb[ig]). nbb[ig] varies with ig. So it is a 2D array with irregular shape.

I am thinking it is something like a loop:

 for (ig=0; ig<ng;ig++)
{
     #pragma acc enter data copyin(A[ig][0:nbb[ig]])
}

This looks odd to me and the compiler does not like it, Is there is a way to copy this 2D array to the device?

Many Thanks,
Feng

First, “copyout” will delete the data as well, so here you’re actually deleting the memory twice. I don’t believe it will cause a problem since the data most likely not be present, so the second delete will be a no-op, but it’s not needed.

This may be too much detail, but there’s a reference counter associated with each variable. Each time it enters a data region, the counter is incremented and decremented upon exit of the region. It’s not until the reference counter is zero, does the delete actually happen. So in the above example, if you had a “enter data create” before the “copyin”, then the second delete would be needed. The exception is if a “finalize” clause is added to the “exit data”, in which case the delete occurs no matter the reference count.

Also, given that device allocation is expensive, the compiler runtime doesn’t actually free the device memory. Rather it adds it back to it’s device memory pool and reuses it later. There’s an environment variable that can be used to disable this, if needed.

This looks odd to me and the compiler does not like it, Is there is a way to copy this 2D array to the device?

This looks ok to me given it’s a jagged array. If it was a rectangular array, you could just use “arr[:size1][:size2]”, but for jagged arrays each row needs to be created separately. You are missing the creation of A’s first dimension, but I’m not sure if you just omitted it from the snip-it, or don’t include it in the actual code.

What error are you getting?

Below is an updated example of my jagged array example code from the Parallel Programming with OpenACC book. The original example doesn’t work any longer due to some clarification in the OpenACC standard and compiler updates to meet these updates. Most of these compiler updates were added mid last year, so if you’re using an older compiler version, I’d suggest updating to a more recent rlease.

Example of creating jagged arrays:

#include <stdlib.h>
#include <stdio.h>

#ifndef N
#define N 32
#endif

double ** allocData(size_t size1, int * sizes);
void deleteData(double ** A, size_t size1);
void initData(double **A, size_t size1, int * sizes, double val);
void printData(double **A, size_t size1, int * sizes);

int main() {

    double **A, **B;
    int * sizes;
    size_t size1, i, j;
    size1 = N;

    sizes=(int*) malloc(sizeof(int)*size1);
    for (i=0; i < size1; ++i) {
       sizes[i]=i+10;
    }
#pragma acc enter data copyin(sizes[0:size1])
    A=allocData(size1,sizes);
    B=allocData(size1,sizes);
    initData(B,size1,sizes,2.5);

/* Perform the computation on the device */
#pragma acc parallel loop gang present(A,B,sizes)
    for (j=0; j < size1; ++j) {
       int size2 = sizes[j];
#pragma acc loop vector
       for (i=0; i < size2; ++i) {
          A[j][i] = B[j][i] + (double) ((j*size2)+i);
       }
    }
#ifdef _OPENACC
/* Copy back the results */
    for (j=0; j < size1; ++j) {
#pragma acc update self (A[j][0:sizes[j]])
    }
#endif

    printData(A,size1,sizes);
    deleteData(A,size1);
    deleteData(B,size1);
#pragma acc exit data delete(sizes)
    free(sizes);
    exit(0);
}

double ** allocData(size_t size1, int * sizes) {
    double ** tmp;
    int i;
    tmp = (double **) malloc(size1*sizeof(double*));
/* Create an array of pointers */
#pragma acc enter data create(tmp[0:size1])
    for (i=0; i < size1; ++i) {
       tmp[i] = (double *) malloc(sizes[i]*sizeof(double));
/* Create the vector and attach it to the pointer array */
#pragma acc enter data create(tmp[i][0:sizes[i]])
    }
    return tmp;
}

void deleteData(double ** A, size_t size1) {
    int i;
    for (i=0; i < size1; ++i) {
#pragma acc exit data delete(A[i])
       free(A[i]);
  }
#pragma acc exit data delete(A)
    free(A);
}

void initData(double **A, size_t size1, int * sizes, double val) {
    size_t i,j;
    for (j=0; j < size1; ++j) {
       int size2=sizes[j];
       for (i=0; i < size2; ++i) {
          A[j][i] = val;
       }
/* Update the device with the initial values */
#pragma acc update device(A[j][0:size2])
    }
}

void printData(double **A, size_t size1, int * sizes) {
    size_t i;
    printf("Values:\n");
    for (i=0; i < 5; ++i) {
        int last = sizes[i]-1;
        printf("A[%d][0]=%f A[%d][%d]=%f\n",i,A[i][0],i,last,A[i][last]);
    }
    printf("....\n");
    for (i=size1-5; i < size1; ++i) {
        int last = sizes[i]-1;
        printf("A[%d][0]=%f A[%d][%d]=%f\n",i,A[i][0],i,last,A[i][last]);
    }
}

Building and running the example using NV HPC v22.3:

% nvc -fast -acc -Minfo=accel  jagged.c
main:
     25, Generating enter data copyin(sizes[:size1])
     31, Generating present(A[:][:],sizes[:],B[:][:])
         Generating NVIDIA GPU code
         31, #pragma acc loop gang /* blockIdx.x */
         34, #pragma acc loop vector(128) /* threadIdx.x */
     34, Loop is parallelizable
     40, Accelerator clause: upper bound for dimension 0 of array 'A' is unknown
         Generating update self(A[:1][:sizes->])
     49, Generating exit data delete(sizes[:1])
allocData:
     59, Generating enter data create(tmp[:size1])
     63, Generating enter data create(tmp->[:sizes->])
deleteData:
     72, Generating exit data delete(A->[:1][:1])
     74, Generating exit data delete(A[:1][:1])
initData:
     86, Accelerator clause: upper bound for dimension 0 of array 'A' is unknown
         Generating update device(A[:1][:size2])
% ./a.out
Values:
A[0][0]=2.500000 A[0][9]=11.500000
A[1][0]=13.500000 A[1][10]=23.500000
A[2][0]=26.500000 A[2][11]=37.500000
A[3][0]=41.500000 A[3][12]=53.500000
A[4][0]=58.500000 A[4][13]=71.500000
....
A[27][0]=1001.500000 A[27][36]=1037.500000
A[28][0]=1066.500000 A[28][37]=1103.500000
A[29][0]=1133.500000 A[29][38]=1171.500000
A[30][0]=1202.500000 A[30][39]=1241.500000
A[31][0]=1273.500000 A[31][40]=1313.500000

Hope this helps,
Mat

Hi Mat,

The functions are called many times. If I don’t put the “#pragma acc exit data delete” after “#pragma acc exit data copyout”, I don’t have the correct results somehow. In the end,I will do those in one go only once, and put the “delete” in the destructor (I believe “finalize” might be a better option here if I want to free the memory in the device?)

I’ve tried your way to copy the jagged array to the device in my code. It seems working well!

Once again, many thanks for your help.
Feng

But with different variables in each, correct? I was more meaning if the same variables were used in both the copyout and delete, then it would be redundant. I should have been more clear.

Finalize is only really needed if the variable is used multiple times on an enter data directive, but once on an exit. Something like:

#pragma acc enter data copyin(Arr[:size])   <<  Create and copy Arr to device, set ref count = 1
... some code ...
#pragma acc enter data copyin(Arr[:size] )  << Arr is present, so no create or copy is performed, but ref count = 2
... more code
#pragma acc exit data delete(Arr)   << Arr refcnt now =1, not deleted.unless finalize is added

The ref counter is an important feature since it allows for nesting of date regions where the same variable can be used more than once. Especially given that there four types of data regions: structured, unstructured, “declare”, and an implicit data region on compute regions.

It’s rare to find programs that use finalize since most codes correctly match enter/exit regions, but when you move to using subroutines in your library to create and delete data, you might consider using it since the user could call the create routine more than once. Doubtful, but possible, so adding finalize could add some robustness. Though, probably not necessary.

I’ve probably given too much detail and confused things, sorry.

I’ve tried your way to copy the jagged array to the device in my code. It seems working well!

Excellent! Glad examples I wrote ~8 years are still helpful.

-Mat

Hi Mat,

Yes, with different variables. I went through your comments and the details actually really helped me to understand it better. Regarding to your previous comments on “firstprivate”

Note that scalars are firstprivate by default (with a few exceptions) so including them in the “private” clause isn’t necessary. Normally it doesn’t hurt including them in a private clause (just save some typing and cleans up the code a bit), but if any of the values need to be initialized, then you’ll want to use “firstprivate” instead. “firstprivate” data is initialized to the value from the host, while “private” data is not.

Suppose I have a function like:

void func(double var1, double var2, ........)
{
    double var3;
    var3 = some_value;
    //openacc directives start from here
}

var1, var2 and var3 are all scalars. var1 and var2 are values passed to this function, var3 are initialized before openacc directives. Do I need to include var1, var2 and var3 in “firstprivate”?

Many thanks,
Feng

Yes, or let the compiler implicitly privatize them by not including them in any clause. Just don’t include them in a “private” clause where they become uninitialized.

I should note that the “firstprivate” clause is only allowed on a “parallel” or “serial” construct. For “kernels”, the default is for these scalars to be included in a “copy” clause. Slightly different but generally has the same effect. It’s similar to if you’re passing a variable to a subroutine by value (firstprivate) or by reference (copy).

-Mat

-Mat

Hi Mat,

Sorry for the slow reply. Thanks for your explanation!

Feng

Hi Mat,

I am trying to parallel a nested loop and got inconsistent results with my CPU results, could you please shed some light on it and where I made a mistake?

    #pragma acc parallel loop \
    //#pragma acc loop seq \
     private(tmp)\
     present (isrc[0:nqdst],sdxdxsrc[0:nx*nx*nqsrc],sdqdxsrc[0:nx*nv*nqsrc],sdxdxdst[0:nx*nx*nqdst],sdqdxdst[0:nx*nv*nqdst],this)\
     default(none)
     for( iq=iqs;iq<iqe;iq++ )
    {
        i0= isrc[iq];
        i1= iq;
        for( ix=0;ix<3;ix++ )
       {   
           for( jx=0;jx<3;jx++ )
          {   
              sdxdxdst[ADDR(ix,jx,i1,nqdst)]= sdxdxsrc[ADDR(ix,jx,i0,nqsrc)];
          }
           
           for( iv=0;iv<nv;iv++ )
          {   
              sdqdxdst[ADDR(iv,ix,i1,nqdst)]= sdqdxsrc[ADDR(iv,ix,i0,nqsrc)];
          }
       }

        for( ix=0;ix<3;ix++ )
       {
           for( jx=0;jx<3;jx++ )
          {
              tmp[jx]= sdxdxdst[ADDR(ix,jx,iq,nqdst)];
          }
           coo->voffset(  f, tmp );
           for( jx=0;jx<3;jx++ )
          {
              sdxdxdst[ADDR(ix,jx,iq,nqdst)] = tmp[jx];
          }
       }

        for( ix=0;ix<3;ix++ )
       {
           for( jx=0;jx<3;jx++ )
          {
              tmp[jx]= sdxdxdst[ADDR(jx,ix,iq,nqdst)];
          }
           coo->voffset( f, tmp );
           for( jx=0;jx<3;jx++ )
          {
              sdxdxdst[ADDR(jx,ix,iq,nqdst)] = tmp[jx];
          }
       }
        for( iv=0;iv<nv;iv++ )
       {
           for( ix=0;ix<3;ix++ )
          {
              tmp[ix]= sdqdxdst[ADDR(iv,ix,iq,nqdst)];
          }
           coo->voffset( f, tmp );
           for( ix=0;ix<3;ix++ )
          {
              sdqdxdst[ADDR(iv,ix,iq,nqdst)] = tmp[ix];
          }
       }
        for( ix=0;ix<3;ix++ )
       {
           for( iv=0;iv<nvel;iv++ )
          {
              tmp[iv]= sdqdxdst[ADDR(iv,ix,iq,nqdst)];
          }
           coo->voffset( f, tmp );
           for( iv=0;iv<nvel;iv++ )
          {
              sdqdxdst[ADDR(iv,ix,iq,nqdst)] = tmp[iv];
          }
       }
    }

The function

             coo->voffset( f, tmp );

is a trivial function to do coordinate transformation and it is declared in the header file as:

     #pragma acc routine seq
     virtual void voffset(double f, double *var );

If I used “#pragma acc loop seq”, so it runs sequentially and I have the same results as my CPU results. Once I use “parallel loop”, the results become quite different. I could not understand why it is so.

Many thanks in advance,
Feng

Hi Feng,

Often inconsistent results are due to race conditions in the code. While I can’t tell for sure what’s wrong, I see two potential issues.

  1. Need for atomics?
       for( jx=0;jx<3;jx++ )
      {   
          sdxdxdst[ADDR(ix,jx,i1,nqdst)]= sdxdxsrc[ADDR(ix,jx,i0,nqsrc)];
      }

You don’t show what the ADDR macro is, but one possible issue is if more than one thread uses the same index. I’d assume that each thread uses it’s own index, but if not, you’d want to add “#pragma acc atomic update” before each.

  1. Auto parallelization of the inner loops

Since you don’t define the loop schedule, i,e, “gang”, “worker”, “vector”, or “seq”, on the outer loop, the compiler will implicitly parallelize them if it can. The typical strategy is apply gang to the outermost loop and vector to the inner stride-1 dimension loops. So what could be happening is that the inner loops are getting parallelized and since “tmp” is private to the gang, this could cause a race condition.

The compiler will show you how it’s scheduling via the compiler feedback messages, i.e. add “-Minfo=accel” to your flags. Look to see how the loops are getting scheduled or post the output if you need help interrupting the output.

If this is indeed the case, add “gang vector” to your outer loop to explicitly set the schedule and the inner loops will be run sequentially.

 #pragma acc routine seq
 virtual void voffset(double f, double *var );

We don’t yet support virtual functions on the device. There’s been some changes in the CUDA driver and device that make it now possible (it wasn’t before), but this work is still on going and no ETA when it will be ready. It’s fine when targeting the host, but I’d suggest not using virtual functions in OpenACC until we can add this support.

Note that if none of these suggestions help find the issue, I’d need a reproducing example in order to determine the cause.

-Mat

Hi Mat,

Many thanks for your help. I’ve tried your suggestions and the problem is from the virtual function. I guess the version of “voffset” in the root class will be used, this is why it produces the wrong results?

Regarding to your comments on loops, I have uploaded a sample code which tries to parallel this function. just for my curiosity, what is the proper way to parallel this function? for the moment, I only use the simple “#pragma acc parallel loop”

Thanks,
Feng

main.cpp (4.8 KB)
Makefile (269 Bytes)

Sorry, I’m not sure since we don’t support virtual functions in device code, I don’t use them.

I have uploaded a sample code which tries to parallel this function. just for my curiosity, what is the proper way to parallel this function? for the moment, I only use the simple “#pragma acc parallel loop”

This should be fine. I still might add “gang vector” on the parallel loop. We can’t auto-parallelize the inner loops right now due to the computed index vector wont be applied, but in case that changes, you wouldn’t want the very small 3x3 inner loops parallelized. Also since the stride-1 dimension, i.e. “k” in the ADDR macro, corresponds to the “iq” index so you do want to make sure “vector” is on the “iq” loop.

The code will fail in the present check if offloading given you’re missing an outer data region. Also performance will be poor given the loop trip count is only 20. This simply isn’t enough work for a GPU and the data movement cost will dominate performance. Ideally “nq” would be in the thousands, if not 10’s of thousands.

-Mat

Hi Mat,

Thanks for your advice. It has always been very helpful! I was taking the shortcut to compile the sample code with “-gpu=managed”.

Thanks,
Feng

Sorry, I didn’t look at your makefile, just the code.

Hi Mat,

Thanks for your help. I have manged to port my library to GPU and the speed up is very promising.

I noticed that, for small test cases (degree of freedom (DOF) around a few thousands), the CPU and GPU results are (almost) identical. This is what I have been checking when I was porting the code step by step to make sure I have the right results on the GPU.

With the same code, now if I try a much larger case, the DOF of which is more than one million. I start seeing small differences between CPU and GPU results. Do this kind of behavior make sense to you?

Thanks,
Feng