License issue when using pgi/20.4 compiler

Hi,

Recently I tried to reproduce the GPU performance from a previous paper but I had some issues with it. One suspicion I had was the compiler version (the paper used pgi/20.4 while I am using nvhpc/23.7 now). When I installed pgi/20.4 and tried to use it, I got the following error:

NOTE: your trial license has expired.  To obtain a permanent license, contact sales@pgroup.com, your local PGI reseller, or for more purchase options, go to:      http://www.pgroup.com/purchase/index.htm If you need additional time to evaluate the PGI compilers and tools to make a purchasing decision, send a request for an evaluation extension to sales@pgroup.com.

I understood that the PGI compiler still charged for license before and now NVHPC is free. Is there a way I could bypass this PGI license issue here?

Thanks,
Jian

Hi Jian,

Is there a way I could bypass this PGI license issue here?

You can’t bypass the license and I don’t have the ability to create temp licenses any longer.

Licenses for the legacy PGI compilers are available for folks on HPC Compiler Support (See the bottom of this page for details: High Performance Computing HPC SDK | NVIDIA Developer). I’d suggest sending a note to Enterprise Services (enterpriseservices@nvidia.com) to see if they can help.

Recently I tried to reproduce the GPU performance from a previous paper but I had some issues with it.

I might be able to help here. What are the issues?

-Mat

Hi Mat,

Thanks for your quick reply. I will forward the instructions to our HPC team.

About my issue: the code is written in C++ and OpenACC for GPU offload. In our paper years ago, the OpenACC code (compiled by pgi/20.4) ran about 5 mins for a test case on a single NVIDIA V100 GPU. Now the same OpenACC code (compiled by nvhpc/23.7) ran about 1.2 hours for the same test case on a single NVIDIA V100 GPU. Note that many other dependent libraries (including CUDA) also changed so my suspicion of compiler version here could simply be red-herring. But I just want to rule out some potential culprits first. Sorry that I am still at an early stage of debugging this issue and I could not provide much information. Let me know if you are aware of any known performance hit between the legacy PGI compiler and recent NVHPC compiler. Thanks.

With that dramatic of a difference the code is likely not being offloaded, or some kernels aren’t getting parallelized so running serially on the device.

I’d check the compiler feedback messages by adding “-Minfo=accel” and make sure everything is getting parallelized. If you can share the source, I can take a look as well.

Thanks Mat. I will check the -Minfo=accel output as suggested and see if I find anything suspicious later.

Hi Mat,

Thanks for your suggestions about obtaining pgi/20.4. It turns out that my previous performance issue came from a user error and I was able to reproduce the GPU performance by using pgi/20.4 now.

Unfortunately, the nvhpc/23.7 compiler still runs about 2x slower than the pgi/20.4 for the same OpenACC code on NVIDIA V100 GPU. Below is an example of GPU kernel with poor performance from nvhpc/23.7:

    	#pragma acc parallel loop gang vector collapse(2) vector_length(32) private(tmp,kB,xk,k,l,m,iIndex,jIndex,kIndex) //[5.0.1]
    	for (iIndex = 0; iIndex < iDim; iIndex++) {
      	  for (jIndex = 0; jIndex < jDim; jIndex++) {
            #pragma acc loop vector
	    for (k = 0; k < kDim; k++) {
          	kIndex = INDEX(iIndex, jIndex, k, iDim, jDim, varDim, var);
	       	kB[k] = Bstate[kIndex];
	    }

      	    // Multiply by gamma
	    for (m = 0; m < kRankVar; m++) {
	       	//bk[m] = 0;
              tmp = 0;
	      for (k = 0; k < kDim; k++) {
	       	tmp += kGamma[var][kDim * m + k] * kB[k];
	      }
	      // Solve for A's using compact storage
	      for (l=-1;l>=-(kLDim-1);l--) {
	       	if ((m + l >= 0) and ((m * kLDim - l) >= 0)) {
	      	  tmp -= kL[var][m * kLDim - l] *xk[m + l];
                }
	      }
	      xk[m] = tmp / kL[var][m * kLDim];
	    }
	    for (k = kRankVar - 1; k >= 0; k--) {
          	tmp = xk[k];
	       	for (l = 1; l <= (kLDim - 1); l++) {
	       	   if ((k + l < kRankVar) and (((k + l) * kLDim + l) < kRankVar * kLDim)) {
	             tmp -= kL[var][(k + l) * kLDim +l] * xk[k + l];
            	   }
	        }
	        xk[k] = tmp / kL[var][k * kLDim];
	    }

            #pragma acc loop vector
	    for (k = 0; k < kDim; k++) {
	      // Multiply by gammaT
	      tmp = 0;
	      for (m = 0; m < kRankVar; m++) {
	       	tmp += kGamma[var][kDim * m + k] * xk[m];
	      }
              kIndex = INDEX(iIndex, jIndex, k, iDim, jDim, varDim, var);
	      Astate[kIndex] = tmp; 
	    }
      	  }
    	}

Thanks for the suggestion about the -Minfo=accel option. The output from pgi/20.4 looks like:

    841, Generating Tesla code
        841, #pragma acc loop gang collapse(2) /* blockIdx.x */
        842,   /* blockIdx.x collapsed */
        844, #pragma acc loop vector(32) /* threadIdx.x */
        850, #pragma acc loop seq
        853, #pragma acc loop vector(32) /* threadIdx.x */
             Generating implicit reduction(+:tmp)
        857, #pragma acc loop vector(32) /* threadIdx.x */
             Generating implicit reduction(+:tmp)
        864, #pragma acc loop seq
        866, #pragma acc loop vector(32) /* threadIdx.x */
             Generating implicit reduction(+:tmp)
        875, #pragma acc loop vector(32) /* threadIdx.x */
        878, #pragma acc loop seq

while the compilation output from nvhpc/23.7 looks like:

    841, Generating implicit firstprivate(kRankVar)
         Generating NVIDIA GPU code
        841, #pragma acc loop gang, vector(32) collapse(2) /* blockIdx.x threadIdx.x */
        842,   /* blockIdx.x threadIdx.x collapsed */
        844, #pragma acc loop seq
        850, #pragma acc loop seq
        853, #pragma acc loop seq
        857, #pragma acc loop seq
        864, #pragma acc loop seq
        866, #pragma acc loop seq
        875, #pragma acc loop seq
        878, #pragma acc loop seq

It turns out that for this piece of code, nvhpc/23.7 applies vector at the outer-most two nested loops, while pgi/20.4 applies vector at the inner-most loop. I guess this is where the performance difference comes from. Technically I think I should not write the code in this way (I probably should put vector either at the outer- or inner-most loop only) so the performance difference may be an implementation deficiency rather than a compiler issue.

Technically this code is not OpenACC compliant as vectors can’t be nested. In 20.4 it looks like we’re ignoring the outer “vector” and applying the inner, but since then we’ve been working towards being more standard compliant so it’s using the outer vector and ignoring the inner.

Hence to replicate the schedule being used in 20.4, you likely just need to remove the “vector” from the “parallel loop gang” directive on the outer loop.

(I probably should put vector either at the outer- or inner-most loop only)

And that would the correct way to code this. Though given the vector_lenth is only 32, what they may have intended was to use is “gang worker” on the outer loop so more CUDA threads were used per block. “worker” maps to the y dimension of a thread block while “vector” maps to the x dimension. So if num_workers is 4, then you’d get 128 threads per block, I’m not sure it would be more performant, by you might try it.