Performance decrease with PGI 12.1

xlapillonne · February 17, 2012, 10:14am

Hi,

I am seeing some performance decrease with one of my code (about 1.4x) when going from pgi 11.10 to 12.1.

Looking at the compiler’s feedbacks for the most time consuming kernel I can see that the new version seems to use register differently:

\

PGI 11.10:

    977, Loop is parallelizable
         Accelerator kernel generated
        977, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             Cached references to size [5] block of 'tinc'
             Using register for 'tfh'
             Using register for 'tfm'
             Using register for 'fr_land'
             Using register for 't_g'
             Using register for 'lo_ice'
             Using register for 'h_ice'
             Using register for 'gz0'
             Using register for 'hhl'
             Non-stride-1 accesses for array 'grad'
             Using register for 'z0m'
             Using register for 'tcm'
             Using register for 'tp'
             Using register for 'lcircterm'
             Using register for 'd_pat'
             Using register for 'l_pat'
             Using register for 'lay'
             Using register for 'ps'
             Using register for 'qd'
             Using register for 'ql'
             Using register for 'pr'
             Using register for 'frc'
             Using register for 'src'
             Using register for 'qc'
             Non-stride-1 accesses for array 'tinv'
             CC 1.3 : 124 registers; 60 shared, 1792 constant, 168 local memory bytes; 9% occupancy
             CC 2.0 : 63 registers; 44 shared, 1688 constant, 0 local memory bytes; 25% occupancy

PGI 12.1

    977, Loop is parallelizable
         Accelerator kernel generated
        977, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             Cached references to size [5] block of 'tinc'
             Non-stride-1 accesses for array 'grad'
             Non-stride-1 accesses for array 'tinv'
             CC 1.3 : 124 registers; 60 shared, 1792 constant, 168 local memory bytes; 9% occupancy
             CC 2.0 : 63 registers; 44 shared, 1688 constant, 0 local memory bytes; 25% occupancy

note that the “Non-stride-1 accesses for array ‘grad’” should not be an issue here as this is just a private coefficients array with 4 elements.

Any suggestion on how I could re-activate the previous optimization.

Thanks,

Xavier

MatColgrove · February 17, 2012, 7:17pm

Hi Xavier,

Can you please send a reproducing example to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me? I’ll need to the code in order to investigate what’s happening.

Thanks,
Mat

xlapillonne · February 27, 2012, 9:54am

Hi Mat,

I did send a reproducing example one week ago to trs@pgroup.com. Did you recieved it ?

Xavier

MatColgrove · February 27, 2012, 10:54pm

Hi Xavier,

Sorry about that. They got it but didn’t forward it on to me.

I took a look at the code and it appears to me that the performance difference is being caused by the CUDA version being used. We switched from using CUDA 3.2 to CUDA 4.0 as the default device tool chain. I show the following kernel times for the loop at line 977 (Times are in microseconds).

17957 11.10 with CUDA 3.2 (default)
29402 11.10 with CUDA 4.0 (-ta=nvidia,4.0)
28076 12.2 with CUDA 4.0 (default)
17921 12.2 with CUDA 3.2 (-ta=nvidia,3.2)

I also looked at the PGI generated CUDA kernels and see only minor differences. We’ll need to contact NVIDIA since it seems to be an issue with their back end tools… Do you mind if we share your code with them?

FYI, this issue is being tracked as TPR#18489.

Note that CUDA 3.2 does not ship with PGI 2012 so I needed to add a soft link from the “$PGI/2011/cuda/3.2/” directory to “$PGI/2012/cuda/”.

Mat

xlapillonne · February 28, 2012, 9:07am

Hi,

Sure you can share with them the test code.
Also there was a second question on this e-mail, would it be possible to have a comment on it ?

Thanks,

Xavier

MatColgrove · February 28, 2012, 5:58pm

Hi Xavier,

Also there was a second question on this e-mail, would it be possible to have a comment on it ?

I’m assuming you mean the question about the non-stride-1 messages for grad. Here the minfo is correct since for private arrays since one large block of memory is allocated and then partitioned across threads. i.e. thread 0 gets the first four elements, thread 1, the next four, and so fourth. Thus non-stride-1 accesses.

I had not really questioned this until now, but thought since the compiler controls how grad is created, can’t it make the accesses contiguous? I just talked with Michael Wolfe and he agreed to investigate if this can be improved. I have added TPR#18490 to track this request.

Best Regards,
Mat

xlapillonne · February 29, 2012, 7:44am

Hi Mat

No this is fine, the non stride one index acces is probably not a big issue here.

The question was relative to the email I send you with this code:

I had an old open TPR18188 some time last year with the same test code. If I try to compile this code with –Mcuda the compiler crashes during compilation with the following error:
PGF90-F-0701-Error reading temp file - nmptrs (/project/s83/lapixa/GPU/tmp/PGI/turb_standalone_2/src/turb_standalone.f90)
PGF90/x86-64 Linux 12.1-0: compilation aborted

Is there any update about this TPR18188 ?

Thanks,

Xavier

MatColgrove · February 29, 2012, 9:41pm

Is there any update about this TPR18188 ?

It looks like this may have been fixed in 12.2. However, I need to check with our engineers why the TPR is still open. There may be addition issues that they are working on.

With 11.10, I see the original ICE:

pgf90 -ta=nvidia -Mcuda -V11.10 -r8 -Kieee -Mbyteswapio -Mfree -Mmpi -Mpreprocess -Minform=inform -D__COSMO__ -c turb_standalone.f90 -o ../obj/turb_standalone.o
PGF90-S-0000-Internal compiler error. size_of:bad dtype    1892 (turb_standalone.f90: 1021)
PGF90-S-0000-Internal compiler error. size_of: bad dtype      697 (turb_standalone.f90: 1021)

With 12.1, I see the “temp” file error:

pgf90 -ta=nvidia -Mcuda -V12.1 -r8 -Kieee -Mbyteswapio -Mfree -Mmpi -Mpreprocess -Minform=inform -D__COSMO__ -c turb_standalone.f90  -o ../obj/turb_standalone.o
PGF90-F-0701-Error reading temp file - lab (turb_standalone.f90)
PGF90/x86-64 Linux 12.1-0: compilation aborted

With 12.2, the code compiles:

pgf90 -ta=nvidia -Mcuda -V12.2 -r8 -Kieee -Mbyteswapio -Mfree -Mmpi -Mpreprocess -Minform=inform -D__COSMO__ -c turb_standalone.f90  -o ../obj/turb_standalone.o
pgf90 -ta=nvidia -Mcuda -V12.2 -o turb_standalone_1  ../obj/*.o

Mat

xlapillonne · May 9, 2012, 7:52am

Hi,

Is there any news concerning the performance issue reported in the first post here ?

I took a look at the code and it appears to me that the performance difference is being caused by the CUDA version being used. We switched from using CUDA 3.2 to CUDA 4.0 as the default device tool chain. I show the following kernel times for the loop at line 977 (Times are in microseconds).

17957 11.10 with CUDA 3.2 (default)
29402 11.10 with CUDA 4.0 (-ta=nvidia,4.0)
28076 12.2 with CUDA 4.0 (default)
17921 12.2 with CUDA 3.2 (-ta=nvidia,3.2)

I also looked at the PGI generated CUDA kernels and see only minor differences. We’ll need to contact NVIDIA since it seems to be an issue with their back end tools… Do you mind if we share your code with them?

Did you have any chance to share with nvidia the code I send you, and to get some feedback ?

Best regards,

Xavier

MatColgrove · May 9, 2012, 11:42pm

Hi Xavier,

Using CUDA 4.1, the time reduces to 23011 microseconds. Not quite as good as CUDA 3.2, but better then 4.0.

Though, in looking at the code I think this might be a good candidate for the OpenACC Parallel construct. The PGI Accelerator Model as well as the OpenACC Kernels construct, only work well on tightly nested loops. Since this code has conditional code surrounding the inner loops, the inner loops can’t be parallelized. However, OpenACC “Parallel” will allow you to use a “gang” (in CUDA terms a block) to parallel the outer loop and a “vector” (in CUDA terms the threads in a block) to parallelize the inner loops.

“Parallel” is still in development but let me check where we’re at and I’ll see what I can do with your code.

Mat

xlapillonne · May 10, 2012, 7:04am

Hi Mat,

Thanks for the reply. I however tried on our latest install:
pgi 12.4 + cuda 4.2 and I got:
_977 : 28962 us (measured with nvidia profiler)

Which is better, but still much slower than what I used to get with 3.2. Wihch version of the PGI compiler did you use to get this 23011 us ?

Concerning your second remark I am not so sure that one is willing to put the vector on the inner loop as it is not the stride one index (for kernel l_911).
For other codes with similar loop structure, it was so far always faster to have the gang and vector on the outer loop (using the parallel construct with a cray compiler), the outer loop corresponding here to the stride-1 index. Unfortunately I haven’t got yet an OpenAcc version of this code.

Best regards,

Xavier

MatColgrove · May 10, 2012, 5:02pm

Hi Xavier,

Wihch version of the PGI compiler did you use to get this 23011 us ?

12.4 but the difference is that I used CUDA 4.1. With CUDA 4.2, I also see the 29000.

Concerning your second remark I am not so sure that one is willing to put the vector on the inner loop as it is not the stride one index (for kernel l_911).

I understand, but this is solvable.

Mat

Topic		Replies	Views
Getting Performance on Titan Legacy PGI Compilers	12	11809	December 27, 2016
PGF90-W-0155-Compiler failed ... with PGI 12.4 Legacy PGI Compilers	17	11275	August 30, 2012
compiler ask acc routine information for internal function Legacy PGI Compilers	12	20312	October 25, 2017
How to parallelize this loop... Legacy PGI Compilers	14	7818	December 18, 2012
License issue when using pgi/20.4 compiler Legacy PGI Compilers	6	297	April 16, 2024
Performance of pgi openaccfor a matrix-matrix multiplication Legacy PGI Compilers	2	4735	May 1, 2014
OpenACC: Problem with present directive and module array Legacy PGI Compilers	14	9244	August 14, 2012
Compilation problems for loop parallelization Legacy PGI Compilers	8	4510	May 21, 2012
error for a simple OPENACC program Legacy PGI Compilers	23	11840	May 16, 2013
PGI Acc: Matrix-matrix-multiplication Legacy PGI Compilers	3	5176	September 10, 2010

Performance decrease with PGI 12.1

Related topics