PGI 14.1 and K20x Cards: Best Mcuda flag to use?

In my investigations of using a K20x card, I’ve found the best compilation strategy is:

-Mcuda=nofma,5.0,kepler,ptxinfo -Mcuda=maxregcount:72

When I do this, the code compiles and generates:

pgfortran -fast -r4 -Mextend -Mpreprocess -Ktrap=fp -Kieee -tp=sandybridge-64 -Mcuda=nofma,5.0,cc35,ptxinfo -Mcuda=maxregcount:72  -DNITERS=6 -DBIG -DGPU_PRECISION=8 -c src/sorad.F90
ptxas info    : 3248 bytes gmem, 576 bytes cmem[3]
ptxas info    : Compiling entry function 'soradmod_sorad_' for 'sm_35'
ptxas info    : Function properties for soradmod_sorad_
    18704 bytes stack frame, 1168 bytes spill stores, 1336 bytes spill loads
ptxas info    : Used 72 registers, 344 bytes cmem[0], 280 bytes cmem[2]

leading to these timers:

 ----- Timings ----- 
 Time in Milliseconds 
    Total :   2831.156 +/-      3.349
   Kernel :   2425.110 +/-      1.852
Data Xfer :    382.362 +/-      2.245

I hit upon the 72 registers as the best number, so I’ve been forcing that.

But, I decided to look and see if 14.1 has better/newer settings, to wit:

$ pgfortran -Mcuda=help
...
    emu             Enable emulation mode
    tesla           Compile for Tesla architecture
    tesla+          Compile for Tesla architecture and above
    cc1x            Compile for compute capability 1.x
    cc1+            Compile for compute capability 1.x and above
    fermi           Compile for Fermi architecture
    fermi+          Compile for Fermi architecture and above
    cc2x            Compile for compute capability 2.x
    cc2+            Compile for compute capability 2.x and above
    kepler          Compile for Kepler architecture
    kepler+         Compile for Kepler architecture and above
    cc3x            Compile for compute capability 3.x
    cc3+            Compile for compute capability 3.x and above
    ...

I see that cc35 isn’t here (although it is in the man page), so I wondered is cc35 discouraged? I tried running with -Mcuda=kepler, thinking maybe it would detect the card correctly but I got this:

pgfortran -fast -r4 -Mextend -Mpreprocess -Ktrap=fp -Kieee -tp=sandybridge-64 -Mcuda=nofma,5.0,kepler,ptxinfo -Mcuda=maxregcount:72  -DNITERS=6 -DBIG -DGPU_PRECISION=8 -c src/sorad.F90
ptxas warning : Too big maxrregcount value specified 72, will be ignored
ptxas info    : 3248 bytes gmem, 576 bytes cmem[3]
ptxas info    : Compiling entry function 'soradmod_sorad_' for 'sm_30'
ptxas info    : Function properties for soradmod_sorad_
    18768 bytes stack frame, 1656 bytes spill stores, 1844 bytes spill loads
ptxas info    : Used 63 registers, 344 bytes cmem[0], 284 bytes cmem[2]
...
 ----- Timings ----- 
 Time in Milliseconds 
    Total :   3166.526 +/-      1.995
   Kernel :   2492.151 +/-      1.438
Data Xfer :    650.974 +/-      2.208

As you can see, it targeted cc30 (sm_30), not cc35 and so led to slower timings. This makes since, I suppose, since Kepler is not just cc35, but cc30 too, but I guess I thought “kepler” might notice a cc35 and target it.

Also, as a note, I’m using cuda50 here because cuda55 leads to worse performance:

pgfortran -fast -r4 -Mextend -Mpreprocess -Ktrap=fp -Kieee -tp=sandybridge-64 -Mcuda=nofma,5.5,cc35,ptxinfo -Mcuda=maxregcount:72  -DNITERS=6 -DBIG -DGPU_PRECISION=8 -c src/sorad.F90
ptxas info    : 3259 bytes gmem, 576 bytes cmem[3]
ptxas info    : Compiling entry function 'soradmod_sorad_' for 'sm_35'
ptxas info    : Function properties for soradmod_sorad_
    18584 bytes stack frame, 572 bytes spill stores, 544 bytes spill loads
ptxas info    : Used 72 registers, 344 bytes cmem[0], 276 bytes cmem[2]
...
 ----- Timings ----- 
 Time in Milliseconds 
    Total :   3501.825 +/-      8.523
   Kernel :   2824.677 +/-      8.268
Data Xfer :    653.633 +/-      0.822

Looks like it does the “spill” heuristics differently…and in a bad way for me. Hmm…

Thanks,
Matt

Hi Matt,

I see that cc35 isn’t here (although it is in the man page), so I wondered is cc35 discouraged?

The x in “cc3x” is trying to indicate “insert number here”, i.e. “cc30”, “cc35”. I’ll see if we can make this more clear. So, yes cc35 is still supported and encouraged if you have a CC 3.5 device.

Though, why CUDA 5.5 is giving slower performance is a mystery.

  • Mat