Hi Mat,
Thanks for your suggestions about obtaining pgi/20.4. It turns out that my previous performance issue came from a user error and I was able to reproduce the GPU performance by using pgi/20.4 now.
Unfortunately, the nvhpc/23.7 compiler still runs about 2x slower than the pgi/20.4 for the same OpenACC code on NVIDIA V100 GPU. Below is an example of GPU kernel with poor performance from nvhpc/23.7:
#pragma acc parallel loop gang vector collapse(2) vector_length(32) private(tmp,kB,xk,k,l,m,iIndex,jIndex,kIndex) //[5.0.1]
for (iIndex = 0; iIndex < iDim; iIndex++) {
for (jIndex = 0; jIndex < jDim; jIndex++) {
#pragma acc loop vector
for (k = 0; k < kDim; k++) {
kIndex = INDEX(iIndex, jIndex, k, iDim, jDim, varDim, var);
kB[k] = Bstate[kIndex];
}
// Multiply by gamma
for (m = 0; m < kRankVar; m++) {
//bk[m] = 0;
tmp = 0;
for (k = 0; k < kDim; k++) {
tmp += kGamma[var][kDim * m + k] * kB[k];
}
// Solve for A's using compact storage
for (l=-1;l>=-(kLDim-1);l--) {
if ((m + l >= 0) and ((m * kLDim - l) >= 0)) {
tmp -= kL[var][m * kLDim - l] *xk[m + l];
}
}
xk[m] = tmp / kL[var][m * kLDim];
}
for (k = kRankVar - 1; k >= 0; k--) {
tmp = xk[k];
for (l = 1; l <= (kLDim - 1); l++) {
if ((k + l < kRankVar) and (((k + l) * kLDim + l) < kRankVar * kLDim)) {
tmp -= kL[var][(k + l) * kLDim +l] * xk[k + l];
}
}
xk[k] = tmp / kL[var][k * kLDim];
}
#pragma acc loop vector
for (k = 0; k < kDim; k++) {
// Multiply by gammaT
tmp = 0;
for (m = 0; m < kRankVar; m++) {
tmp += kGamma[var][kDim * m + k] * xk[m];
}
kIndex = INDEX(iIndex, jIndex, k, iDim, jDim, varDim, var);
Astate[kIndex] = tmp;
}
}
}
Thanks for the suggestion about the -Minfo=accel
option. The output from pgi/20.4 looks like:
841, Generating Tesla code
841, #pragma acc loop gang collapse(2) /* blockIdx.x */
842, /* blockIdx.x collapsed */
844, #pragma acc loop vector(32) /* threadIdx.x */
850, #pragma acc loop seq
853, #pragma acc loop vector(32) /* threadIdx.x */
Generating implicit reduction(+:tmp)
857, #pragma acc loop vector(32) /* threadIdx.x */
Generating implicit reduction(+:tmp)
864, #pragma acc loop seq
866, #pragma acc loop vector(32) /* threadIdx.x */
Generating implicit reduction(+:tmp)
875, #pragma acc loop vector(32) /* threadIdx.x */
878, #pragma acc loop seq
while the compilation output from nvhpc/23.7 looks like:
841, Generating implicit firstprivate(kRankVar)
Generating NVIDIA GPU code
841, #pragma acc loop gang, vector(32) collapse(2) /* blockIdx.x threadIdx.x */
842, /* blockIdx.x threadIdx.x collapsed */
844, #pragma acc loop seq
850, #pragma acc loop seq
853, #pragma acc loop seq
857, #pragma acc loop seq
864, #pragma acc loop seq
866, #pragma acc loop seq
875, #pragma acc loop seq
878, #pragma acc loop seq
It turns out that for this piece of code, nvhpc/23.7 applies vector
at the outer-most two nested loops, while pgi/20.4 applies vector
at the inner-most loop. I guess this is where the performance difference comes from. Technically I think I should not write the code in this way (I probably should put vector
either at the outer- or inner-most loop only) so the performance difference may be an implementation deficiency rather than a compiler issue.