PGI Accelerator on NVIDIA S1070 and S2050 Fermi

sindimo1 · February 1, 2011, 7:19am

Dear Mat,

We have some code being accelerated on an NVIDIA S1070 GPU (compute capability 1.3) using PGI 10.9 directives (we are not using PGI 11.1 because we get internal compiler errors for some reason, but it compiles fine on 10.9).

The NVIDIA driver we are using is for CUDA 3.1 since the manual of PGI 10.9 still does not certify the compiler to run on a CUDA 3.2 driver.

The code runs fine and produces correct results on the S1070 GPU.

Next we wanted to run the code on an S2050 GPU (Fermi with compute capability 2.0). We noticed that when you compile the code, the compiler by default creates a binary that is compute capability 1.3 even though we are running on a Fermi. To overwrite the default, we use the cc20 flag “-ta=nvidia,time,cuda3.0,cc20” and it produces a 2.0 compute capability binary.

If we don’t use the cc20 flag on Fermi, the code produces totally bizzare results.

When we use the cc20 flag on Fermi, the code runs but produces results that are a bit off than what we are expecting (the run on the S1070 produces correct expected results though).

Is there any special settings needed from the PGI side to get things working on Fermi correctly other than compiling with the cc20 option?

I found this PGI posting by Michael Wolfe and that’s where I got the cc20 option from:
http://www.pgroup.com/lit/articles/insider/v2n2a1.htm

We have NVIDIA involved in this as well but no answers so far.

Thank you for your help.

Mohamad Sindi

TheMatt · February 1, 2011, 12:58pm

sindimo,

In my experience, when the GPU results are a bit off, the first thing you might want to try is using the nofma option in your -ta/-Mcuda option list. I’ve found that, on Teslas at least, nofma seems to help with accuracy, though at the cost of some performance. (NB: I don’t have access to a Fermi plus PGI compiler yet, so I’m not sure if nofma has as large–if any–an effect with them or not.) I’m also not sure what effect -Kieee would have on your code, but you might want to try it as well.

Also, the example -ta settings you list have “cuda3.0”. Is there a reason you are using that and not “cuda3.1”? If nothing else, I seem to recall the cuda3.1 PTX assembler was faster, which is nice.

If these don’t work, real/PGI Mat will know more.

Matt

sindimo1 · February 1, 2011, 1:59pm

Thanks Matt for your feedback.

we already tried the nofma option as we looked it up in the PGI manual and it didn’t help, and we’re already using the -Kieee flag during compilation. Also using nofma did slow down the run a bit.

As for the 3.0 vs 3.1, the PGI compiler seems to mess up during compilation when I use 3.1 with cc2.0 while 3.0 seems to work better, see below the registers, shared, etc… being assigned to zeros with 3.1 while with 3.0 it shows correct assignment. Both binaries still run and produce same results, but the run with 3.0 seem to be slightly faster than 3.1 for some reason.

3.0

Accelerator kernel generated
        278, !$acc do vector(32)
        283, !$acc do parallel
             Cached references to size [32] block of 'jeven'
             Cached references to size [32] block of 'jodd'
             CC 2.0 : 62 registers; 1028 shared, 960 constant, 592 local memory bytes; 16 occupancy

3.1

   Accelerator kernel generated
        278, !$acc do vector(32)
        283, !$acc do parallel
             Cached references to size [32] block of 'jeven'
             Cached references to size [32] block of 'jodd'
             CC 2.0 : 0 registers; 0 shared, 0 constant, 0 local memory bytes; 16 occupancy

Thanks

MatColgrove · February 1, 2011, 4:33pm

Hi Mohamad Sindi,

It sounds like there are multiple points of failure, so it might be best if you can send a report with a reproducing example to PGI Customer Service (trs@pgroup.com)? Ask them to forward the mail to me.

We noticed that when you compile the code, the compiler by default creates a binary that is compute capability 1.3 even though we are running on a Fermi.

The default is to produce multiple compute capabilities. It should have produced cc13 and cc20.

Thanks,
Mat

Topic		Replies	Views
Fermi Legacy PGI Compilers	2	6622	April 6, 2010
Could not find GPU binary file Legacy PGI Compilers	2	3346	April 21, 2012
call to cuModuleLoadData returned error 209 Legacy PGI Compilers	4	4608	September 21, 2015
Meaning of target compiler option cuda2.3/ cuda3.0? Legacy PGI Compilers	2	7324	April 12, 2010
Preferred CUDA Version for 10.4? Legacy PGI Compilers	5	10987	April 13, 2010
unsupported device type Legacy PGI Compilers	6	9053	July 30, 2010
how to compile using -ta=nvidia suboptions Legacy PGI Compilers	4	11965	July 14, 2009
Supported accelerators Legacy PGI Compilers	1	9400	July 13, 2009
PGI 12.8+ and cc30 Legacy PGI Compilers	3	3472	December 7, 2012
Starting Accel. Fortran Legacy PGI Compilers	2	3668	February 17, 2011

PGI Accelerator on NVIDIA S1070 and S2050 Fermi

Related topics