Errors: memory copy idiom and unsupported local variables

Hi,

When I compile my program with PGI 11.9 compiler, I got the following errors:

Compiling /home/rengan/clever/clever_pgi/sources/funcs.c
clever_process:
94, Memory copy idiom, loop replaced by call to __c_mcopy4
getNeighbors:
177, Memory copy idiom, loop replaced by call to __c_mcopy4
PGC-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unsupported local variable (/home/rengan/clever/clever_pgi/sources/funcs.c: 218)
findRegions:
221, Loop is parallelizable
Accelerator scalar kernel generated
Accelerator kernel generated
221, #pragma acc for parallel, vector(256) /* blockIdx.x threadIdx.x */
225, Loop carried scalar dependence for ‘bestRep’ at line 233
Scalar last value needed after loop for ‘bestRep’ at line 239
Scalar last value needed after loop for ‘bestRep’ at line 242
Loop carried scalar dependence for ‘bestDist’ at line 233
Accelerator restriction: scalar variable live-out from loop: bestRep
Inner sequential loop scheduled on accelerator
PGC/x86-64 Linux 11.9-0: compilation completed with warnings

the code around line 94 is:

            currentReps = (int*)malloc(sizeof(int)*512);
94:       for(j=0; j<512; j++)
                     currentReps[j] = bestReps[j];

and the code around line 177 is:

               if(delRep < tRepSize - 1)
                {
177:            for(j = delRep; j < tRepSize-1; j++)
                    {
                        nebRepIDs[nebID*512+j] = nebRepIDs[nebID*512+j+1];
                    }
                }

They are just usual for loops. And there are no such errors when I compile the serial code which has no acc directives.

The findRegions() function is :

void findRegions(struct dataPt *datasetBegin, int dataSetSize, int* neighboringTable, int* h_solutionSize, int i, clusterID_type* clusterIDs)
{
       int i_dataPoint, i_rep;
       struct dataPt *pt1, *pt2;
218: #pragma acc region copyin(datasetBegin[0:dataSetSize-1], neighboringTable[0:4096*512-1], h_solutionSize)
     {
         #pragma acc for parallel 
         for(i_dataPoint = 0; i_dataPoint < dataSetSize; i_dataPoint++)
         {
              int bestRep = -1;
              float bestDist = -1;
              for(i_rep = 0; i_rep<h_solutionSize[i]; i_rep++)
              {
                   pt1 = datasetBegin + i_dataPoint;
                   pt2 = datasetBegin + neighboringTable[512*i+i_rep];
                   float dx = pt1->x - pt2->x;
                   float dy = pt1->y - pt2->y;
                   float d = (dx*dx)+(dy*dy);
                   if(d<bestDist || bestRep==-1)
                   {
                        bestRep = i_rep;
                        bestDist = d;
                    }
                 }
                 clusterIDs[i_dataPoint] = bestRep;
                 struct dataPt *tmpPt;
                 tmpPt = datasetBegin + i_dataPoint;
                 tmpPt->clusterID = bestRep;
         }
     }
}

The definitions of struct dataPt and cluster_IDs are as follows:

typedef unsigned char clusterID_type;
struct dataPt {
    float x, y;
    int z;
    clusterID_type clusterID;
};
clusterID_type* cluster_IDs;

Here, does the unsupported local variable mean the variable “datasetBegin” which is a pointer of struct dataPt?

Some compilation flags in Makefile are:
CC = pgcc
CFLAGS = -ta=nvidia,time -Minfo -Msafeptr
OPT = -O3
LD = pgcc
LDFLAGS = -ta=nvidia,time -Minfo -Msafeptr

Thanks for your help.

Hi Silence,

The “Memory copy idiom” messages are just letting you know that the compiler has optimised these sections of host code by replacing your for loops with a call to a fast version of mcopy. This is not causing any problems.

Your accelerator region has a couple of issues that’s causing it to fail. The main one is that pointer manipulation, while being actively worked on, is not yet supported. The “Unsupported local variable” is the local declaration of the struct tmpPt pointer. Also, you may need to give the size of h_solutionSize array.

Give the following code a try to see if it works for you:

% cat test.c
typedef unsigned char clusterID_type;
struct dataPt {
    float x, y;
    int z;
    clusterID_type clusterID;
};
clusterID_type* cluster_IDs; 

void findRegions(struct dataPt *datasetBegin, int dataSetSize, int* neighboringTable, int* h_solutionSize, int i, clusterID_type* clusterIDs)
{
       int i_dataPoint, i_rep;
#pragma acc region copyin(datasetBegin[0:dataSetSize-1], neighboringTable[0:4096*512-1], h_solutionSize[0:dataSetSize-1])
     {
         #pragma acc for parallel
         for(i_dataPoint = 0; i_dataPoint < dataSetSize; i_dataPoint++)
         {
              int bestRep = -1;
              float bestDist = -1;
              for(i_rep = 0; i_rep<h_solutionSize[i]; i_rep++)
              {
	           float dx =  datasetBegin[i_dataPoint].x - datasetBegin[neighboringTable[512*i+i_rep]].x;
	           float dy =  datasetBegin[i_dataPoint].y - datasetBegin[neighboringTable[512*i+i_rep]].y;
                   float d = (dx*dx)+(dy*dy);
                   if(d<bestDist || bestRep==-1)
                   {
                        bestRep = i_rep;
                        bestDist = d;
                    }
                 }
                 clusterIDs[i_dataPoint] = bestRep;
         }
     }
}
  • Mat

Thanks very much. It can be compiled successfully now but it has segmentation fault error. Can we use PGI debugger PGDBG to debug such directive-based program?

Can we use PGI debugger PGDBG to debug such directive-based program?

Not yet. We’re working on adding debugging support to CUDA Fortran but adding it to directives will be a lot harder. The major obstacle is that since there is a considerable amount of transformation occurring, it makes it difficult to correlate back to the original code. It would be equivalent to debugging highly optimised assembly code, which for most users would be extremely difficult. We’re working on it though.

So to debug this code, I’d first start with the host version. Make sure there are no out-of-bounds errors, uninitiated memory, etc. These error may or may not cause problems in host code, but will in GPU code. I find valgrind is a useful tool here.

Next, look at your data copies. Since it’s a segv, most likely one of the array bounds is wrong. Take a look at the -Minfo=accel info messages and check if the bounds the compiler is using match what how the array was declared.

If you were getting a cuda launch error, then it’s more likely an error in the kernel itself. Here, start commenting out lines until the error goes away. This should isolate the line that’s causing the problem and at least give you an idea of why the error is occurring.

  • Mat