Using the structure inside for loop OpenAcc

Hi all,

I’m a newbie in Openacc. Currently, i’m trying to use structure with FOR LOOP openacc, then i got some errors:

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

call to cuMemFreeHost returned error 700: Illegal address during kernel execution

My code:

typedef struct tagMuHistogram
{
	int		nWidth;
	int		nHeight;
	int		nOffset;
	int		mushift;
	int		dim;
	float	**muData;
} MU_HIST;

typedef struct {
int imageW, imageH, imageMarginSize;

int calcAddr( int x, int y ) {
    return ((y-imageMarginSize) * (imageW-imageMarginSize*2) + (x-imageMarginSize));
}

void setHogPattern( int x, int y, int val ) {
		pHogPatterns[ calcAddr(x,y) ]= val;
	}

void setHogValue( int x, int y, int val ) {
		pHogValues[ calcAddr(x,y) ]= val;
	}
} MakeMuHistContext;

void calLBP(IplImage *img, LPMU_HIST muHist, int offset, ADJUSTMKHIST adjust, MakeMuHistContext *pCtx, int musize)
{
unsigned char	*imgPtr, *imgAbovePtr, *imgBelowPtr;
float ** muDATA = muHist->muData;

#pragma acc data copy(muDATA[0:muHist->nWidth * muHist->nHeight][0:muHist->dim], img->imageData[0:c*(h*w)]) create(imgPtr[0:1], imgAbovePtr[0:1], imgBelowPtr[0:1])
{
    #pragma acc parallel loop private(lbpVal, histVal, x,y,dim,imgHeightMinOne, imgWidthMinOne, musize)
    for (y = MARGIN_SIZE; y < img->height - MARGIN_SIZE; y++)
    {
        //#pragma acc parallel loop
        for (x = MARGIN_SIZE; x < img->width - MARGIN_SIZE; x++,imgPtr++)
        {
            imgAbovePtr  = imgPtr - ws;
            imgBelowPtr  = imgPtr + ws;
            
            ....

            muDATA[muIdx][voteIdx] += (float)histVal;

            //if( pCtx!=(MakeMuHistContext *)NULL ) {
            //     pCtx->setHogPattern( x,y, lbpVal );
            //     pCtx->setHogValue( x,y, histVal );
            //}
        }
    }
}
}

Please help me:

  1. copy(muDATA[0:muHist->nWidth * muHist->nHeight][0:muHist->dim]) ===> Is it wrong?
    If i replace muDATA[muIdx][voteIdx] += (float)histVal; by int tmp = muDATA[muIdx][voteIdx];, the code can run. I don’t know why.

  2. How to copy pCtx, MakeMuHistContext to GPU?

Thank you!

Hi Tony Nguyen,

Can you post or send to PGI Customer Service (trs@pgroup.com) a complete reproducing example? Since there’s a lot of missing information from this snipit, it’s difficult to determine the exact problem. I’ll do my best to at least highlight some concerns.

  1. copy(muDATA[0:muHist->nWidth * muHist->nHeight][0:muHist->dim]) ===> Is it wrong?

Syntactically it seems ok, though I don’t know if the values are correct since you don’t show the size of the muHist->muData array.

If i replace muDATA[muIdx][voteIdx] += (float)histVal; by int tmp = muDATA[muIdx][voteIdx];, the code can run. I don’t know why.

I’m not sure either. For what you show, I’m not seeing anything wrong. Though you don’t show how the index variables are computed. Are they shared or private? Also are the values the same across iterations of the loops? If so, then you have a potential race condition and may need to use an “atomic” directive so the values are updated correctly.

How to copy pCtx, MakeMuHistContext to GPU?

Since the size of the MakeMuHistContext is fixed (i.e. there are no dynamic data members) you should be able to just copy the pointer to the device similar to how you copy imgPtr. However, you may need to also decorate the member functions with an OpenACC routine directive if the compiler doesn’t implicitly create the device routines for you.

What I’m not sure about are the “pHogPatterns” and “pHogValues” arrays. Are these global arrays or are they data members that you didn’t list? If they are global arrays, then you need to use a “declare” directive to create the global data on the device as well. If they are data members, then my previous statement above is incorrect in that for pCtx, you need to perform a manual deep copy of the dynamic data members.

You may consider compiling with “-ta=telsa:managed” so the compiler will use CUDA Unified Memory. All dynamically allocated data will be managed by the CUDA runtime creating a virtual memory address that’s accessible on both the host and device and you don’t need to worry about deep copies or data movement of you pointers. You can then go back later to manually manage the data once you have the code running correctly.

Note that from what you show, you have a loop dependency on “imgPtr”, “imgAbovePtr”, and “imgBelowPtr”. Parallelizing these loops will cause a data race. You’ll definitely want to privatize “imgAbovePtr”, and “imgBelowPtr” instead of making them shared. Remove these from the data region and instead put them in the private clause. You don’t show how “imgPtr” is initialized so I’m not sure the best path forward for it. If it’s initialized within the “y” loop, then you can simply privatize it. It means that the inner-loop can’t be parallelized, but that wont effect correctness, just performance. If “imgPtr” is initialized before the “y” loop, then both loops can’t be parallelized since “imgPtr” needs to be incremented sequentially. Instead, you’ll want to set the value of the temp img pointer based on the loop indices as offsets from the base imgPtr.

-Mat

Hi Mat Colgrove,

I sent my code to trs@pgroup.com.
Please help me check it.

Can you show me sample about using openacc with pointer ‘**’ sample as my code: float ** var.

Thank you so much!!!

I’m trying with simple code which same as above code, but it also be error.

typedef struct tagMuHistogram
{
	int		nWidth;
	int		nHeight;
	int		nOffset;
	int		mushift;
	int		dim;
	float	**muData;
}
MU_HIST,  *LPMU_HIST;


bool allocMuHist(int width, int height, int dim, LPMU_HIST muHist, int mushift)
{
	char errorMsgBuf[512];
	
	muHist->nWidth  = width  >> mushift;
	muHist->nHeight = height >> mushift;
	muHist->mushift = mushift;
	muHist->nOffset = 0;

	int		i;
	int		muArea = muHist->nWidth * muHist->nHeight;

	muHist->muData = (float**)malloc(sizeof(float*) * muArea);
	
	for( i = 0; i < muArea; i++ ){
		muHist->muData[i] = (float*)malloc(sizeof(float) * dim);
          }

    return true;
}

void clearMuHist(int dim, LPMU_HIST	muHist)
{
	int	i;
	int muArea = muHist->nWidth * muHist->nHeight;
	for( i = 0; i < muArea; i++ ){
		memset(muHist->muData[i], 0, sizeof(float) * dim);
	}
	muHist->nOffset = 0;
	return;
}

int main()
{
    int w = img->width;
    int h = img->height;
    int c = img->nChannels;
    int ws = img->widthStep;
    float ** muDATA = muHist->muData;
    int wHist = muHist->nWidth;
    int hHist = muHist->nHeight;
    int dHist = muHist->dim;
    allocMuHist(nWidthFit, nHeightFit, muHist->dim, muHist, mushift);
    clearMuHist(muHist->dim, muHist); 

    
#pragma acc data create(muHist[0:1]) copy(muHist->muData[0:wHist * hHist][0:dHist], img->imageData[0:c*(h*w)]) 
{
    #pragma acc parallel loop
    for (y = 5; y < 50; y++)
    {
        // #pragma acc parallel loop
        for (x = 5; x < 50; x++)
        {
           muHist->muData[7][38] = 0.23;
            
        }
    }
}
    #pragma endregion
    std::cout << "EXIT FOR LOOP" << std::endl;
}
}

Hi Tony Nguyen,

I’ve responded to your note to PGI Customer Service, however, the code you sent was incomplete. I did my best to answer (much the same as what I noted above), but would need a complete reproducing example to offer better advice.

The code example above is also incomplete and missing declarations for several variables as well as missing include headers. Hence the code does not compile, even without OpenACC. Can you please repost a complete example with reproduces the issue?

Thanks,
Mat