Problem Using barrier:PGC-F-0155-Illegal context for barrier

Greetings PGI,

I am trying to launch Matrix Multiplication kernel from different CPU cores to GPU (K20c).
I want all the CPU threads first copy data to GPU memory, sync and then launch the kernel all at the same time. Here is the code:

#include <math.h>
#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main(int argc, char *argv[])

	if(argc != 3) {
    		fprintf(stderr,"Use: %s size nIter\n",argv[0]);
    		return -1;
  	int Dim=atoi(argv[1]);
  	int itr=atoi(argv[2]);
  	if(itr <= 0) {
    		fprintf(stderr,"%s: Invalid nIter (%d)\n",argv[0],itr);
    		return -1;


      #pragma omp parallel
      	int i=0,j=0,k=0,c=0,d=0;
      	float Matrix1[Dim][Dim], Matrix2[Dim][Dim],Result_ACC[Dim][Dim],Result_OMP[Dim][Dim],Diff=0,sum=0;
      	double tstart, tstop;
      	for (i=0;i<Dim;i++){

        #pragma acc data pcopyin(Dim,Matrix1[0:Dim][0:Dim], Matrix2[0:Dim][0:Dim])

        #pragma omp barrier

        tstart = omp_get_wtime();

        #pragma acc kernels loop create(Result_ACC[0:Dim][0:Dim],sum)
        for (c = 0; c < Dim; c++) {
           for (d = 0; d < Dim; d++) {
              sum = 0;
              for (k = 0; k < Dim; k++) {
                 sum = sum + Matrix1[c][k]*Matrix2[k][d];
              Result_ACC[c][d] = sum;

        tstop = omp_get_wtime();
        printf ("Thread %d: %G seconds on GPU.\n",omp_get_thread_num(), tstop-tstart);

      return 0;

As you can see, I first copy all the data to GPU ( #pragma acc data pcopyin(Dim,Matrix1[0:Dim][0:Dim], Matrix2[0:Dim][0:Dim])) , then sync all the threads (#pragma omp barrier) and start my timer. When I compile the code I get PGC-F-0155-Illegal context for barrier (main.c: 41) error.
What I am doing here is taking the data transfer overhead off my timing calculation and making sure that all threads launch the kernels at the same time.
Basically I am trying to figure out how the kernel scheduler works. If it runs all the kernels from CPU threads in parallel? Specifically, I am interested to know how the gangs and vectors are mapped to SMX. Any reference on details on kernel scheduling would be a great help.

Thanks a bunch,

Hi Ali,

Let me send this to engineering (TPR#22009) to see if it’s something we can support. Right now, we’re expecting barrier to follow a host statement and not another pragma. The work around is to add a bit of host code before the barrier. For example:

        tstart = 0;
        #pragma omp barrier
        tstart = omp_get_wtime();


Sometimes it takes an extra set of eyes.

The problem here is that you need to “acc data” region be a structured block otherwise it only applies to a single statement. So the better fix is to add “{}” over the data region.

     #pragma acc data pcopyin(Dim,Matrix1[0:Dim][0:Dim], Matrix2[0:Dim][0:Dim]) 
        #pragma omp barrier 
     .... loop ....
  • Mat

Thanks a lot.