Hi,
I am seeing some performance decrease with one of my code (about 1.4x) when going from pgi 11.10 to 12.1.
Looking at the compiler’s feedbacks for the most time consuming kernel I can see that the new version seems to use register differently:
\
PGI 11.10:
977, Loop is parallelizable
Accelerator kernel generated
977, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
Cached references to size [5] block of 'tinc'
Using register for 'tfh'
Using register for 'tfm'
Using register for 'fr_land'
Using register for 't_g'
Using register for 'lo_ice'
Using register for 'h_ice'
Using register for 'gz0'
Using register for 'hhl'
Non-stride-1 accesses for array 'grad'
Using register for 'z0m'
Using register for 'tcm'
Using register for 'tp'
Using register for 'lcircterm'
Using register for 'd_pat'
Using register for 'l_pat'
Using register for 'lay'
Using register for 'ps'
Using register for 'qd'
Using register for 'ql'
Using register for 'pr'
Using register for 'frc'
Using register for 'src'
Using register for 'qc'
Non-stride-1 accesses for array 'tinv'
CC 1.3 : 124 registers; 60 shared, 1792 constant, 168 local memory bytes; 9% occupancy
CC 2.0 : 63 registers; 44 shared, 1688 constant, 0 local memory bytes; 25% occupancy
PGI 12.1
977, Loop is parallelizable
Accelerator kernel generated
977, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
Cached references to size [5] block of 'tinc'
Non-stride-1 accesses for array 'grad'
Non-stride-1 accesses for array 'tinv'
CC 1.3 : 124 registers; 60 shared, 1792 constant, 168 local memory bytes; 9% occupancy
CC 2.0 : 63 registers; 44 shared, 1688 constant, 0 local memory bytes; 25% occupancy
note that the “Non-stride-1 accesses for array ‘grad’” should not be an issue here as this is just a private coefficients array with 4 elements.
Any suggestion on how I could re-activate the previous optimization.
Thanks,
Xavier
Hi Xavier,
Can you please send a reproducing example to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me? I’ll need to the code in order to investigate what’s happening.
Thanks,
Mat
Hi Mat,
I did send a reproducing example one week ago to trs@pgroup.com. Did you recieved it ?
Xavier
Hi Xavier,
Sorry about that. They got it but didn’t forward it on to me.
I took a look at the code and it appears to me that the performance difference is being caused by the CUDA version being used. We switched from using CUDA 3.2 to CUDA 4.0 as the default device tool chain. I show the following kernel times for the loop at line 977 (Times are in microseconds).
17957 11.10 with CUDA 3.2 (default)
29402 11.10 with CUDA 4.0 (-ta=nvidia,4.0)
28076 12.2 with CUDA 4.0 (default)
17921 12.2 with CUDA 3.2 (-ta=nvidia,3.2)
I also looked at the PGI generated CUDA kernels and see only minor differences. We’ll need to contact NVIDIA since it seems to be an issue with their back end tools… Do you mind if we share your code with them?
FYI, this issue is being tracked as TPR#18489.
Note that CUDA 3.2 does not ship with PGI 2012 so I needed to add a soft link from the “$PGI/2011/cuda/3.2/” directory to “$PGI/2012/cuda/”.
Hi,
Sure you can share with them the test code.
Also there was a second question on this e-mail, would it be possible to have a comment on it ?
Thanks,
Xavier
Hi Xavier,
Also there was a second question on this e-mail, would it be possible to have a comment on it ?
I’m assuming you mean the question about the non-stride-1 messages for grad. Here the minfo is correct since for private arrays since one large block of memory is allocated and then partitioned across threads. i.e. thread 0 gets the first four elements, thread 1, the next four, and so fourth. Thus non-stride-1 accesses.
I had not really questioned this until now, but thought since the compiler controls how grad is created, can’t it make the accesses contiguous? I just talked with Michael Wolfe and he agreed to investigate if this can be improved. I have added TPR#18490 to track this request.
Best Regards,
Mat
Hi Mat
No this is fine, the non stride one index acces is probably not a big issue here.
The question was relative to the email I send you with this code:
- I had an old open TPR18188 some time last year with the same test code. If I try to compile this code with –Mcuda the compiler crashes during compilation with the following error:
PGF90-F-0701-Error reading temp file - nmptrs (/project/s83/lapixa/GPU/tmp/PGI/turb_standalone_2/src/turb_standalone.f90)
PGF90/x86-64 Linux 12.1-0: compilation aborted
Is there any update about this TPR18188 ?
Thanks,
Xavier
Is there any update about this TPR18188 ?
It looks like this may have been fixed in 12.2. However, I need to check with our engineers why the TPR is still open. There may be addition issues that they are working on.
With 11.10, I see the original ICE:
pgf90 -ta=nvidia -Mcuda -V11.10 -r8 -Kieee -Mbyteswapio -Mfree -Mmpi -Mpreprocess -Minform=inform -D__COSMO__ -c turb_standalone.f90 -o ../obj/turb_standalone.o
PGF90-S-0000-Internal compiler error. size_of:bad dtype 1892 (turb_standalone.f90: 1021)
PGF90-S-0000-Internal compiler error. size_of: bad dtype 697 (turb_standalone.f90: 1021)
With 12.1, I see the “temp” file error:
pgf90 -ta=nvidia -Mcuda -V12.1 -r8 -Kieee -Mbyteswapio -Mfree -Mmpi -Mpreprocess -Minform=inform -D__COSMO__ -c turb_standalone.f90 -o ../obj/turb_standalone.o
PGF90-F-0701-Error reading temp file - lab (turb_standalone.f90)
PGF90/x86-64 Linux 12.1-0: compilation aborted
With 12.2, the code compiles:
pgf90 -ta=nvidia -Mcuda -V12.2 -r8 -Kieee -Mbyteswapio -Mfree -Mmpi -Mpreprocess -Minform=inform -D__COSMO__ -c turb_standalone.f90 -o ../obj/turb_standalone.o
pgf90 -ta=nvidia -Mcuda -V12.2 -o turb_standalone_1 ../obj/*.o
Hi,
Is there any news concerning the performance issue reported in the first post here ?
I took a look at the code and it appears to me that the performance difference is being caused by the CUDA version being used. We switched from using CUDA 3.2 to CUDA 4.0 as the default device tool chain. I show the following kernel times for the loop at line 977 (Times are in microseconds).
17957 11.10 with CUDA 3.2 (default)
29402 11.10 with CUDA 4.0 (-ta=nvidia,4.0)
28076 12.2 with CUDA 4.0 (default)
17921 12.2 with CUDA 3.2 (-ta=nvidia,3.2)
I also looked at the PGI generated CUDA kernels and see only minor differences. We’ll need to contact NVIDIA since it seems to be an issue with their back end tools… Do you mind if we share your code with them?
Did you have any chance to share with nvidia the code I send you, and to get some feedback ?
Best regards,
Xavier
Hi Xavier,
Using CUDA 4.1, the time reduces to 23011 microseconds. Not quite as good as CUDA 3.2, but better then 4.0.
Though, in looking at the code I think this might be a good candidate for the OpenACC Parallel construct. The PGI Accelerator Model as well as the OpenACC Kernels construct, only work well on tightly nested loops. Since this code has conditional code surrounding the inner loops, the inner loops can’t be parallelized. However, OpenACC “Parallel” will allow you to use a “gang” (in CUDA terms a block) to parallel the outer loop and a “vector” (in CUDA terms the threads in a block) to parallelize the inner loops.
“Parallel” is still in development but let me check where we’re at and I’ll see what I can do with your code.
Hi Mat,
Thanks for the reply. I however tried on our latest install:
pgi 12.4 + cuda 4.2 and I got:
_977 : 28962 us (measured with nvidia profiler)
Which is better, but still much slower than what I used to get with 3.2. Wihch version of the PGI compiler did you use to get this 23011 us ?
Concerning your second remark I am not so sure that one is willing to put the vector on the inner loop as it is not the stride one index (for kernel l_911).
For other codes with similar loop structure, it was so far always faster to have the gang and vector on the outer loop (using the parallel construct with a cray compiler), the outer loop corresponding here to the stride-1 index. Unfortunately I haven’t got yet an OpenAcc version of this code.
Best regards,
Xavier
Hi Xavier,
Wihch version of the PGI compiler did you use to get this 23011 us ?
12.4 but the difference is that I used CUDA 4.1. With CUDA 4.2, I also see the 29000.
Concerning your second remark I am not so sure that one is willing to put the vector on the inner loop as it is not the stride one index (for kernel l_911).
I understand, but this is solvable.