Avoid reallocating memory on the GPU.

JPMN · November 4, 2012, 2:33pm

Hello.

I’m trying to parallelize a program using OpenACC (pgcc).

My program calls a certain function (let’s call it function1()) a lot of times. This function allways uses the same ammount of memory each time it’s called.

Assume tha function1() is something like:

void function1(){

#pragma acc data copyin(ArraysIn[sizeOfArrays]), copyout(ArrayOut[sizeOfArray])

#pragma acc kernels
for(){
    #pragma acc kernels
    for(){
        #pragma acc kernels
        for(){
            /*Do some work here*/
        }
    }
}
}

Since this function is called several times, each time I call it I’m allocating the arrays I need on the GPU. I want to allocate these arrays only once and reuse them so that I don’t need to keep spending time allocating again.

How can I do this, having in mind that the compiler can’t determine the size of the arrays automatically?

Will something like this work?

main{
#pragma acc data create(ArraysIn[sizeOfArrays], ArrayOut[sizeOfArray])

for(){
function1();
}
}

And then change function1() to:

void function1(){

#pragma acc update device(ArraysIn[sizeOfArrays])

#pragma acc kernels
for(){
    #pragma acc kernels
    for(){
        #pragma acc kernels
        for(){
            /*Do some work here*/
        }
    }
}

#pragma acc update host(copyout(ArrayOut[sizeOfArray]))
}

Will this work? Will it work even if this call (#pragma acc data create(ArraysIn[sizeOfArrays], ArrayOut[sizeOfArray])) is made from some other function call before function1()?

MatColgrove · November 5, 2012, 11:34pm

Will this work?

This is the right idea. You just need to add a data region in function1 telling the runtime to go look-up where the array is on the device (via the “present” clause).

You can use the “update” in combination with the “present” clause. (Note that I made a few other small corrections)

void function1(ArraysIn, ArraysOut){

#pragma acc data present(ArraysIn[0:sizeOfArrays],ArraysOut[0:sizeOfArrays])
{
#pragma acc update device(ArraysIn[0:sizeOfArrays])

#pragma acc kernels
for(){
    for(){
        for(){
            /*Do some work here*/
        }
    }
}

#pragma acc update host(copyout(ArrayOut[0:sizeOfArray]))
}
}

You use “pcreate” (present or create) instead of “present” if you don’t know if this routine is called from within another data region. With just “present”, you’d get a runtime error if the data isn’t already on the device. With “pcreate”, the arrays would be created if not found.

Hope this helps,
Mat

JPMN · January 21, 2013, 1:15pm

I’m sorry to bother you again with the same topic, but it took me some time to be able to actually test this and this doesn’t seem to work at all.

This compiles correctly but when I try to run it, I get the following:

FATAL ERROR: data in PRESENT clause was not found: name=clP

clP is one of the arrays I need to allocate on the GPU.

The problem here seems to be that the main function is in one .c file and function1 is in another .c file. Also, I don’t know if this has anything to do with it or not but function1 is not called exactly on the main function. Main calls functionA, funcitonA calls functionB and functionB calls function1.

This looks something like this:

void funtionB(){function1();}

void functionA(){functionB();}

int main(int argc, char* argv[]){
for(){funtionA();}
}

And function1 is what I mentioned previously.

What am I doing wrong here?

If needed I can post the actual code for these functions.

MatColgrove · January 21, 2013, 3:04pm

Hi JPMN,

You just need to add back in the outer data region surrounding the for loop in the main routine to have the arrays copied over to the device before “present” looks them up.

Mat

JPMN · January 21, 2013, 3:23pm

Hi.

First and foremost thank you for the quick reply :).

I’m not sure if I understant your answer though. There was a typo on the main function I presented last post, the actual function is:

int main(int argc, char* argv[]){

#pragma acc data create(tiPL[16*numGammaCats], clL[4*numChars*numGammaCats], tiPR[16*numGammaCats], clR[4*numChars*numGammaCats], clP[4*numChars*numGammaCats], tiPA[16*numGammaCats], clA[4*numChars*numGammaCats], clP[4*numChars*numGammaCats])
{ 
for(){funtionA();} 
}
}

And then, on function1 I have the present clause followed by an update device clause to transfer the data into these arrays.

Are you saying that I need to have the data copied into the device before I can do the present clause? If that’s the case that would ruin this code because function1 always uses the same type of arrays with the same size but the data on the arrays are different from function call to function call.

MatColgrove · January 21, 2013, 8:22pm

Are you saying that I need to have the data copied into the device before I can do the present clause?

It has to be allocated on the device, so using “create” is fine. Alternatively, you can use “present_or_copy” (or “pcopy”) instead of “present”. “pcopy” will look to see if the data’s already been allocated on the device, otherwise it’s allocated and copied over.

If that’s the case that would ruin this code because function1 always uses the same type of arrays with the same size but the data on the arrays are different from function call to function call.

Note that the arrays in the “present” clause don’t have to be the same array each time function1 is called. Nor do the arrays need to start at the beginning. For example, you could create a large block of memory on the device and then pass pointers into this block and have present find the proper device data. So long as the entire array is on the device, “present” will allow it.

Mat

JPMN · January 22, 2013, 3:56pm

I understant all that, but the fact is that the present clause still doesn’t find the arrays I created earlier with the data create clause, I already had that clause in my code when I asked this question. With all those clauses I still get the error:

FATAL ERROR: data in PRESENT clause was not found: name=clP

And I really don’t know why. I’ll post the actual code:

My main function is something like this:

int main(int argc, char * argv[]){
/*Other code here*/
int         numGammaCats=(&modelSettings[0])->numGammaCats, numChars=(&modelSettings[0])->numChars;

/*Other code here*/

#pragma acc data create(tiPL[16*numGammaCats], clL[4*numChars*numGammaCats], tiPR[16*numGammaCats], clR[4*numChars*numGammaCats], clP[4*numChars*numGammaCats], tiPA[16*numGammaCats], clA[4*numChars*numGammaCats])
    {
for (chn=0; chn<numLocalChains; chn++)
		{
		
        
		curLnL[chn] = LogLike(chn);
		curLnPr[chn] = LogPrior(chn);
		for (i=0; i<numCurrentDivisions; i++)
			{
			if (modelSettings[i].gibbsGamma == YES)
				curLnL[chn] += GibbsSampleGamma (chn, i, seed);
			}
		}
    }
}

Where LogLike is the function I referred to as functionA.
This function calls another function called LaunchLogLikeForDivision.
This function calls one of two functions that are very similar and I refered as function1.

These are the following functions:

int CondLikeDown_NUC4_OpenACC (TreeNode *p, int division, int chain)

{
	int				c, h, i, j, k, shortCut, *lState=NULL, *rState=NULL;
	CLFlt			/**clL, *clR, *clP,*/ *pL, *pR/*, *tiPL, *tiPR*/;
	ModelInfo		*m;
    
    struct timeval start, stop;
    
    
    /*OpenACC auxiliary variables*/
    int numGammaCats=0, numChars=0/*, tiP_size=0, cl_size=0*/;
	
	m = &modelSettings[division];
    
	/* flip space so that we do not overwrite old cond likes */
	FlipCondLikeSpaceOACC (m, chain, p->index);
	
	/* find conditional likelihood pointers */
	clL = m->condLikes[m->condLikeIndex[chain][p->left->index ]];
	clR = m->condLikes[m->condLikeIndex[chain][p->right->index]];
	clP = m->condLikes[m->condLikeIndex[chain][p->index       ]];
	
	/* find transition probabilities */
	pL = m->tiProbs[m->tiProbsIndex[chain][p->left->index ]];
	pR = m->tiProbs[m->tiProbsIndex[chain][p->right->index]];
    
    /* find likelihoods of site patterns for left branch if terminal */
	shortCut = 0;

    

		case 0:
            //Count time
            gettimeofday(&start, NULL);
            numGammaCats=m->numGammaCats;
            numChars=m->numChars;
            /*tiP_size=16*numGammaCats*sizeof(CLFlt);
            cl_size=4*numChars*numGammaCats*sizeof(CLFlt);*/
	    tiPL = pL;
	    tiPR = pR;
            
            /*Check if we need memory allocation on the GPU*/
            if (firstTime==1)
                printf("firstTime -> TRUE\n");
            else
                printf("firstTime -> TRUE\n");
            
            
        
        #pragma acc data present(tiPL[16*numGammaCats], clL[4*numChars*numGammaCats], tiPR[16*numGammaCats], clR[4*numChars*numGammaCats], clP[4*numChars*numGammaCats])
        {
        #pragma acc update device(tiPL[16*numGammaCats], clL[4*numChars*numGammaCats], tiPR[16*numGammaCats], clR[4*numChars*numGammaCats])

        #pragma acc kernels loop independent gang(numGammaCats)
	    for (k=0; k<numGammaCats; k++)
        {
            #pragma acc loop independent gang(numChars/numGammaCats) vector(NTHREADS)/*vector(128)*/
            for (c=0; c<numChars; c++)
            {
                #pragma acc loop independent vector(4) /*gang(numGammaCats), vector(128)*/
                for(i=0; i < 4; i++){
                    register int indice = k*numChars*4+c*4;
                    register int indice2 = k*16+i+4;
                    clP[indice+i]  =  (tiPL[indice2+AA]*clL[indice+A] + tiPL[indice2+AC]*clL[indice+C] + tiPL[indice2+AG]*clL[indice+G] + tiPL[indice2+AT]*clL[indice+T])*(tiPR[indice2+AA]*clR[indice+A] + tiPR[indice2+AC]*clR[indice+C] + tiPR[indice2+AG]*clR[indice+G] + tiPR[indice2+AT]*clR[indice+T]);
                    
                }
            }
        }
            
            #pragma acc update host(clP[4*numChars*numGammaCats])
        }
        
            firstTime=0;
            printf("firstTime= %d", firstTime);
            gettimeofday(&stop,NULL);
            timersub(&stop, &start, &stop);
            timeradd(&accumulator, &stop, &accumulator);
    
    return NO_ERROR;
	
}

And:

int CondLikeRoot_NUC4_OpenACC (TreeNode *p, int division, int chain)

{
	int				a, c, h, i, j, k, shortCut, *lState=NULL, *rState=NULL, *aState=NULL;
	CLFlt			/**clL, *clR, *clP, *clA,*/ *pL, *pR, *pA/*, *tiPL, *tiPR, *tiPA*/;
	ModelInfo		*m;
    
    /*OpenACC auxiliary variables*/
    int numGammaCats=0, numChars=0;
    
    struct timeval start, stop;
	
	m = &modelSettings[division];
    
	/* flip state of node so that we are not overwriting old cond likes */
	FlipCondLikeSpaceOACC (m, chain, p->index);
	
	/* find conditional likelihood pointers */
	clL = m->condLikes[m->condLikeIndex[chain][p->left->index ]];
	clR = m->condLikes[m->condLikeIndex[chain][p->right->index]];
    clP = m->condLikes[m->condLikeIndex[chain][p->index       ]];
    clA = m->condLikes[m->condLikeIndex[chain][p->anc->index  ]];
    
	/* find transition probabilities (or calculate instead) */
	pL = m->tiProbs[m->tiProbsIndex[chain][p->left->index ]];
	pR = m->tiProbs[m->tiProbsIndex[chain][p->right->index]];
	pA = m->tiProbs[m->tiProbsIndex[chain][p->index       ]];
    
	
	shortCut = 4;
  
            //Count time
            gettimeofday(&start, NULL);
            numGammaCats=m->numGammaCats;
            numChars=m->numChars;
            tiPL = pL;
            tiPR = pR;
            tiPA = pA;
            #pragma acc data present(tiPL[16*numGammaCats], clL[4*numChars*numGammaCats], tiPR[16*numGammaCats], clR[4*numChars*numGammaCats], tiPA[16*numGammaCats], clA[4*numChars*numGammaCats], clP[4*numChars*numGammaCats])
            {
            #pragma acc update device(tiPL[16*numGammaCats], clL[4*numChars*numGammaCats], tiPR[16*numGammaCats], clR[4*numChars*numGammaCats], tiPA[16*numGammaCats], clA[4*numChars*numGammaCats])
            
            #pragma acc kernels loop independent gang(numGammaCats)
            for (k=0; k<numGammaCats; k++)
			{
                #pragma acc loop independent gang(numChars/numGammaCats) vector(NTHREADS)/*vector(128)*/
                for (c=0; c<numChars; c++)
			    {
                    #pragma acc loop independent vector(4) /*gang(numGammaCats), vector(128)*/
                    for (i=0; i<4; i++) {
                        register int indice = k*numChars*4+c*4;
                        register int indice2 = k*16+i+4;
                        clP[indice+i] =   (tiPL[indice2+AA]*clL[indice+A] + tiPL[indice2+AC]*clL[indice+C] + tiPL[indice2+AG]*clL[indice+G] + tiPL[indice2+AT]*clL[indice+T])
                        *(tiPR[indice2+AA]*clR[indice+A] + tiPR[indice2+AC]*clR[indice+C] + tiPR[indice2+AG]*clR[indice+G] + tiPR[indice2+AT]*clR[indice+T])
                        *(tiPA[indice2+AA]*clA[indice+A] + tiPA[indice2+AC]*clA[indice+C] + tiPA[indice2+AG]*clA[indice+G] + tiPA[indice2+AT]*clA[indice+T]);
                    }
				}
			}
            #pragma acc update host(clP[4*numChars*numGammaCats])
            }
            gettimeofday(&stop,NULL);
            timersub(&stop, &start, &stop);
            timeradd(&accumulator, &stop, &accumulator);
            
        
    
	return NO_ERROR;
	
}

These two last functions are on the mbopenacc.c file, the main, LogLike and LaunchLogLikeForDivision are on a seperate file. The arrays *clL, *clR, *clP, *tiPL, *tiPR, *clA, *tiPA are global variables declared on the mbopenacc.h file.

What am I doing wrong?

MatColgrove · January 23, 2013, 3:23pm

Hi JPMN,

clp and the other pointers point into larger arrays. Since “present” just looks up where the device arrays are located by association with the host arrays, when you re-assign the pointer, they are no longer associated with the earlier “create”. Hence what needs to be created on, or copied to, the device are the “m->condLikes” and “m->tiProbs” arrays.

Mat

JPMN · January 24, 2013, 1:07pm

Hi mkcolg.

Thank you very much for your answer. So the problem here is that I can’t re-assing the pointer that I passed in the create clause.

Is there any way in OpenACC to tell the compiler to transfer an array of name X of the CPU to those smaller arrays on the GPU (clp, etc.)?

Or the solution is to pass the entire larger array at the beginning and then tell the compiler wich part of the array it’s going to access with the present clause? For example if I wanted to access only from position 30 to 40 I would do something like:

#pragma acc data present(tiProbs[30:40])

If this is the case then would I have to change the code on the CondLikeDown_NUC4_OpenACC and CondLikeRoot_NUC4_OpenACC? I mean these computations inside the for loop with these names would no longer work, right?

clP[indice+i]  =  (tiPL[indice2+AA]*clL[indice+A] + tiPL[indice2+AC]*clL[indice+C] + tiPL[indice2+AG]*clL[indice+G] + tiPL[indice2+AT]*clL[indice+T])*(tiPR[indice2+AA]*clR[indice+A] + tiPR[indice2+AC]*clR[indice+C] + tiPR[indice2+AG]*clR[indice+G] + tiPR[indice2+AT]*clR[indice+T]);

Would I have to change the names of clP and the other pointers?

MatColgrove · January 24, 2013, 9:40pm

Ok, let try and back up a bit since you’re not quite getting this.

At the time you use the data region in the main routine, you allocate memory on the device and also associate a host pointer (and it’s range) with the device pointer (and it’s range). When you use the present clause, the host pointer is used to determine which device pointer to use.

So what you need to do, is allocate the larger arrays on the device in the main data region. You then can assign the local host pointers (clp) to a location in the large arrays. You then would put “clp” in the present clause with it’s range. Now clp’s host pointer is used to find the same location of the copy of the large array on the device.

Mat

JPMN · January 25, 2013, 4:25pm

Ok, let’s see if I got it this time.

I need to allocate the larger arrays in the main part and I’ve got this:

int main(int argc, char * argv[]){ 
/*Other code here*/ 
int         numGammaCats=(&modelSettings[0])->numGammaCats, numChars=(&modelSettings[0])->numChars;
ModelInfo		*m = &modelSettings[0];
int tiProbs_x=m->numTiProbs, tiProbs_y=m->tiProbLength, condLikes_x=m->numCondLikes, condLikes_y=m->condLikeLength; 

/*Other code here*/ 

tiProbs_gpu=m->tiProbs;
condLikes_gpu=m->condLikes;
#pragma acc data copyin(tiProbs_gpu[tiProbs_x][tiProbs_y]), copy(condLikes_gpu[condLikes_x][condLikes_y])
    { 
for (chn=0; chn<numLocalChains; chn++) 
      { 
       
        
      curLnL[chn] = LogLike(chn); 
      curLnPr[chn] = LogPrior(chn); 
      for (i=0; i<numCurrentDivisions; i++) 
         { 
         if (modelSettings[i].gibbsGamma == YES) 
            curLnL[chn] += GibbsSampleGamma (chn, i, seed); 
         } 
      } 
    } 
}

Then I should assign the clP and the other variables to parts of the larger arrays, so I do this:

/* find conditional likelihood pointers */
	clL = condLikes_gpu[m->condLikeIndex[chain][p->left->index ]];
	clR = condLikes_gpu[m->condLikeIndex[chain][p->right->index]];
	clP = condLikes_gpu[m->condLikeIndex[chain][p->index       ]];
	
	/* find transition probabilities */
	pL = tiProbs_gpu[m->tiProbsIndex[chain][p->left->index ]];
	pR = tiProbs_gpu[m->tiProbsIndex[chain][p->right->index]];

And the two functions look like this:

int CondLikeDown_NUC4_OpenACC (TreeNode *p, int division, int chain) 

{ 
   int            c, h, i, j, k, shortCut, *lState=NULL, *rState=NULL; 
   CLFlt         /**clL, *clR, *clP,*/ *pL, *pR/*, *tiPL, *tiPR*/; 
   ModelInfo      *m; 
    
    struct timeval start, stop; 
    
    
    /*OpenACC auxiliary variables*/ 
    int numGammaCats=0, numChars=0/*, tiP_size=0, cl_size=0*/; 
    
   m = &modelSettings[division]; 
    
   /* flip space so that we do not overwrite old cond likes */ 
   FlipCondLikeSpaceOACC (m, chain, p->index); 
    
   /* find conditional likelihood pointers */
	clL = condLikes_gpu[m->condLikeIndex[chain][p->left->index ]];
	clR = condLikes_gpu[m->condLikeIndex[chain][p->right->index]];
	clP = condLikes_gpu[m->condLikeIndex[chain][p->index       ]];
	
	/* find transition probabilities */
	pL = tiProbs_gpu[m->tiProbsIndex[chain][p->left->index ]];
	pR = tiProbs_gpu[m->tiProbsIndex[chain][p->right->index]];
    
    /* find likelihoods of site patterns for left branch if terminal */ 
   shortCut = 0; 

    

      case 0: 
            //Count time 
            gettimeofday(&start, NULL); 
            numGammaCats=m->numGammaCats; 
            numChars=m->numChars; 
            /*tiP_size=16*numGammaCats*sizeof(CLFlt); 
            cl_size=4*numChars*numGammaCats*sizeof(CLFlt);*/ 
       tiPL = pL; 
       tiPR = pR; 
            
            /*Check if we need memory allocation on the GPU*/ 
            if (firstTime==1) 
                printf("firstTime -> TRUE\n"); 
            else 
                printf("firstTime -> TRUE\n"); 
            
            
        
        #pragma acc data present(tiPL[16*numGammaCats], clL[4*numChars*numGammaCats], tiPR[16*numGammaCats], clR[4*numChars*numGammaCats], clP[4*numChars*numGammaCats])
        {
        #pragma acc kernels loop independent gang(numGammaCats) 
       for (k=0; k<numGammaCats; k++) 
        { 
            #pragma acc loop independent gang(numChars/numGammaCats) vector(NTHREADS)/*vector(128)*/ 
            for (c=0; c<numChars; c++) 
            { 
                #pragma acc loop independent vector(4) /*gang(numGammaCats), vector(128)*/ 
                for(i=0; i < 4; i++){ 
                    register int indice = k*numChars*4+c*4; 
                    register int indice2 = k*16+i+4; 
                    clP[indice+i]  =  (tiPL[indice2+AA]*clL[indice+A] + tiPL[indice2+AC]*clL[indice+C] + tiPL[indice2+AG]*clL[indice+G] + tiPL[indice2+AT]*clL[indice+T])*(tiPR[indice2+AA]*clR[indice+A] + tiPR[indice2+AC]*clR[indice+C] + tiPR[indice2+AG]*clR[indice+G] + tiPR[indice2+AT]*clR[indice+T]); 
                    
                } 
            } 
        } 
            
        } 
        
            firstTime=0; 
            printf("firstTime= %d", firstTime); 
            gettimeofday(&stop,NULL); 
            timersub(&stop, &start, &stop); 
            timeradd(&accumulator, &stop, &accumulator); 
    
    return NO_ERROR; 
    
}

And this:

int CondLikeRoot_NUC4_OpenACC (TreeNode *p, int division, int chain) 

{ 
   int            a, c, h, i, j, k, shortCut, *lState=NULL, *rState=NULL, *aState=NULL; 
   CLFlt         /**clL, *clR, *clP, *clA,*/ *pL, *pR, *pA/*, *tiPL, *tiPR, *tiPA*/; 
   ModelInfo      *m; 
    
    /*OpenACC auxiliary variables*/ 
    int numGammaCats=0, numChars=0; 
    
    struct timeval start, stop; 
    
   m = &modelSettings[division]; 
    
   /* flip state of node so that we are not overwriting old cond likes */
   FlipCondLikeSpaceOACC (m, chain, p->index); 
    
  /* find conditional likelihood pointers */
  clL = condLikes_gpu[m->condLikeIndex[chain][p->left->index ]];
  clR = condLikes_gpu[m->condLikeIndex[chain][p->right->index]];
  clP = condLikes_gpu[m->condLikeIndex[chain][p->index       ]];
  clA = condLikes_gpu[m->condLikeIndex[chain][p->anc->index  ]];
    
	/* find transition probabilities (or calculate instead) */
  pL = tiProbs_gpu[m->tiProbsIndex[chain][p->left->index ]];
  pR = tiProbs_gpu[m->tiProbsIndex[chain][p->right->index]];
  pA = tiProbs_gpu[m->tiProbsIndex[chain][p->index       ]];
    
    
   shortCut = 4; 
  
            //Count time 
            gettimeofday(&start, NULL); 
            numGammaCats=m->numGammaCats; 
            numChars=m->numChars; 
            tiPL = pL; 
            tiPR = pR; 
            tiPA = pA; 
            #pragma acc data present(tiPL[16*numGammaCats], clL[4*numChars*numGammaCats], tiPR[16*numGammaCats], clR[4*numChars*numGammaCats], tiPA[16*numGammaCats], clA[4*numChars*numGammaCats], clP[4*numChars*numGammaCats]) 
            {
            
            #pragma acc kernels loop independent gang(numGammaCats) 
            for (k=0; k<numGammaCats; k++) 
         { 
                #pragma acc loop independent gang(numChars/numGammaCats) vector(NTHREADS)/*vector(128)*/ 
                for (c=0; c<numChars; c++) 
             { 
                    #pragma acc loop independent vector(4) /*gang(numGammaCats), vector(128)*/ 
                    for (i=0; i<4; i++) { 
                        register int indice = k*numChars*4+c*4; 
                        register int indice2 = k*16+i+4; 
                        clP[indice+i] =   (tiPL[indice2+AA]*clL[indice+A] + tiPL[indice2+AC]*clL[indice+C] + tiPL[indice2+AG]*clL[indice+G] + tiPL[indice2+AT]*clL[indice+T]) 
                        *(tiPR[indice2+AA]*clR[indice+A] + tiPR[indice2+AC]*clR[indice+C] + tiPR[indice2+AG]*clR[indice+G] + tiPR[indice2+AT]*clR[indice+T]) 
                        *(tiPA[indice2+AA]*clA[indice+A] + tiPA[indice2+AC]*clA[indice+C] + tiPA[indice2+AG]*clA[indice+G] + tiPA[indice2+AT]*clA[indice+T]); 
                    } 
            } 
         }
            } 
            gettimeofday(&stop,NULL); 
            timersub(&stop, &start, &stop); 
            timeradd(&accumulator, &stop, &accumulator); 
            
        
    
   return NO_ERROR; 
    
}

However I still get the runtime error:

FATAL ERROR: data in PRESENT clause was not found: name=clP

So what am I doing wrong or not getting this time?
By the way the variabes condLikes_gpu and tiProbs_gpu are global on the file mbopenacc.h.

MatColgrove · January 28, 2013, 7:07pm

So what am I doing wrong or not getting this time?

You have the idea correct. My guess what’s happening here is that in the outer main data region, “tiProbs_gpu” and “condLikes_gpu” data are copies of the member arrays from the first element of the “modelSettings” array. However, in your routines, your using the arrays from “modelSettings[division]”. Unless “division” is zero, this is not the same data was copied in the data region.

What would work well here would be an unstructured data region. So instead of a well defined start and stop of the region, you could call some initialize routine in a loop and use data region for every struct element’s member arrays. Since only the addresses are associated, the name of the particular array would be irrelevant. However, this feature is still just a proposal in the OpenACC 2.0 specification (http://www.openacc.org/sites/default/files/Proposed%20Additions%20for%20OpenACC%202.pdf) so wont be implemented till later this year.

In the meantime, you may need to take the performance hit and copy the member arrays each time they are needed or reorganize your data into coalesced blocks (i.e. convert your array of structs to a struct of arrays.)

Mat

Topic		Replies	Views
Questions on incorrect results with openacc in GPU nvc, nvc++ and nvfortran	33	2404	December 4, 2023
Question about the openACC data regions Legacy PGI Compilers	4	718	April 3, 2023
OpenACC: Problem with present directive and module array Legacy PGI Compilers	14	9240	August 14, 2012
OpenACC dynamically allocate array within loop Legacy PGI Compilers	19	1712	August 12, 2020
Dealing with allocatable arrays with OpenACC Legacy PGI Compilers	8	1816	November 30, 2020
Handling global variables in OpenACC kernels nvc, nvc++ and nvfortran	14	962	August 14, 2023
Clarification on using OpenACC in a shared library Legacy PGI Compilers	27	4495	December 9, 2020
copy(movement) of user defined objects to the gpu in OpenACC Legacy PGI Compilers	22	22493	December 10, 2018
Call in OpenACC region to procedure 'pgf90_copy_f90_argl' Legacy PGI Compilers	10	11401	July 5, 2017
Unknown 8GB memory getting allocated on GPU Legacy PGI Compilers	12	9662	December 7, 2020

Avoid reallocating memory on the GPU.

Related topics