PGI Accelerator on NVIDIA S1070 and S2050 Fermi

Dear Mat,

We have some code being accelerated on an NVIDIA S1070 GPU (compute capability 1.3) using PGI 10.9 directives (we are not using PGI 11.1 because we get internal compiler errors for some reason, but it compiles fine on 10.9).

The NVIDIA driver we are using is for CUDA 3.1 since the manual of PGI 10.9 still does not certify the compiler to run on a CUDA 3.2 driver.

The code runs fine and produces correct results on the S1070 GPU.

Next we wanted to run the code on an S2050 GPU (Fermi with compute capability 2.0). We noticed that when you compile the code, the compiler by default creates a binary that is compute capability 1.3 even though we are running on a Fermi. To overwrite the default, we use the cc20 flag “-ta=nvidia,time,cuda3.0,cc20” and it produces a 2.0 compute capability binary.

If we don’t use the cc20 flag on Fermi, the code produces totally bizzare results.

When we use the cc20 flag on Fermi, the code runs but produces results that are a bit off than what we are expecting (the run on the S1070 produces correct expected results though).

Is there any special settings needed from the PGI side to get things working on Fermi correctly other than compiling with the cc20 option?

I found this PGI posting by Michael Wolfe and that’s where I got the cc20 option from:
http://www.pgroup.com/lit/articles/insider/v2n2a1.htm

We have NVIDIA involved in this as well but no answers so far.

Thank you for your help.

Mohamad Sindi

sindimo,

In my experience, when the GPU results are a bit off, the first thing you might want to try is using the nofma option in your -ta/-Mcuda option list. I’ve found that, on Teslas at least, nofma seems to help with accuracy, though at the cost of some performance. (NB: I don’t have access to a Fermi plus PGI compiler yet, so I’m not sure if nofma has as large–if any–an effect with them or not.) I’m also not sure what effect -Kieee would have on your code, but you might want to try it as well.

Also, the example -ta settings you list have “cuda3.0”. Is there a reason you are using that and not “cuda3.1”? If nothing else, I seem to recall the cuda3.1 PTX assembler was faster, which is nice.

If these don’t work, real/PGI Mat will know more.

Matt

Thanks Matt for your feedback.

we already tried the nofma option as we looked it up in the PGI manual and it didn’t help, and we’re already using the -Kieee flag during compilation. Also using nofma did slow down the run a bit.

As for the 3.0 vs 3.1, the PGI compiler seems to mess up during compilation when I use 3.1 with cc2.0 while 3.0 seems to work better, see below the registers, shared, etc… being assigned to zeros with 3.1 while with 3.0 it shows correct assignment. Both binaries still run and produce same results, but the run with 3.0 seem to be slightly faster than 3.1 for some reason.

3.0

Accelerator kernel generated
        278, !$acc do vector(32)
        283, !$acc do parallel
             Cached references to size [32] block of 'jeven'
             Cached references to size [32] block of 'jodd'
             CC 2.0 : 62 registers; 1028 shared, 960 constant, 592 local memory bytes; 16 occupancy

3.1

   Accelerator kernel generated
        278, !$acc do vector(32)
        283, !$acc do parallel
             Cached references to size [32] block of 'jeven'
             Cached references to size [32] block of 'jodd'
             CC 2.0 : 0 registers; 0 shared, 0 constant, 0 local memory bytes; 16 occupancy

Thanks

Hi Mohamad Sindi,

It sounds like there are multiple points of failure, so it might be best if you can send a report with a reproducing example to PGI Customer Service (trs@pgroup.com)? Ask them to forward the mail to me.

We noticed that when you compile the code, the compiler by default creates a binary that is compute capability 1.3 even though we are running on a Fermi.

The default is to produce multiple compute capabilities. It should have produced cc13 and cc20.

Thanks,
Mat