Different performance for cc20 and cc13


I noticed some divergent results when I used compute device 2.0, vs compute device 1.3. The cc13 is using more shared memory than cc20, according to the feedback, the GPU occupation is also more for cc13.

pgi cc20:
63 registers, 56 shared, 228 constant, 0 local memory bytes, 33% occupation

pgi cc13:
32 registers, 256 shared, 52 constant, 40 local memory bytes, 50% occupation

The code is a 4th order ISO stencil, in Fortran.

Size	pgi_sm13	pgi_sm20
200	1.12235	1.38163
400	5.19504	6.02635
512	10.29905	11.59021
600	21.73567	25.39310
650	26.29589	30.57195

Is this a known behavior?


This is not a known behavior. If possible would you be able to send us an
example of this code so that we can investigate further?


I have tried creating a minimal working example using exactly the same pragmas that is there in the original code. Here is the compilation message for cc13 and cc20:

CC 1.3 : 21 registers; 2168 shared, 12 constant, 0 local memory bytes; 50% occupancy

CC 2.0 : 38 registers; 2064 shared, 120 constant, 0 local memory bytes; 33% occupancy

I can give the actual code if needed, just that I need to extract it and make it individually compilable, if this example does not serve the purpose, then on Monday I will do it. Thanks for all the help. I have used CUDA 4.2 and PGI 12.3. Here is the entire code:

PROGRAM simpleFD25

        INTEGER                         :: nx, ny, nz           !grid points and stencil order
        REAL, DIMENSION(5)              :: c
        REAL                            :: time1, time2
        INTEGER                         :: i,j,k,l
        REAL, ALLOCATABLE               :: u(:,:,:)
        REAL, ALLOCATABLE               :: r(:,:,:)
        !$acc mirror(r)

        !prompt user to enter input
        WRITE(*,'(A)',ADVANCE="NO") "Enter NX NY NZ: "
        READ(*,*) nx, ny, nz

        ALLOCATE (u(0:nx,0:ny,0:nz), r(0:nx,0:ny,0:nz))
        u = 0.; r = 0.
        c = (/1.,1.,1.,1.,1./)
        FORALL (i=1:nx, j=1:ny, k=1:nz)
                u(i,j,k) = float(i+j+k)/(nx+nz+ny)
        END FORALL

        CALL cpu_time(time1)

        !$acc data region copyin(c,u)
        !$acc region
        DO l=1,4
        !$acc do parallel(32) unroll(2)
                DO i=4,nx-4
                !$acc do parallel(64)
                        DO j=4,ny-4
                        !$acc do vector(512)
                                DO k=4,nz-4
                                        r(i,j,k) = c(5) * u(i,j,k) + ( c(l) * u(i+l,j,k) + c(l) * u(i-l,j,k) )       &
                                        + ( c(l) * u(i,j+l,k) + c(l) * u(i,j-l,k) )                                  &
                                        + ( c(l) * u(i,j,k+l) + c(l) * u(i,j,k-l) )
                                END DO
                        END DO
                END DO
        !$acc end region
        !$acc update host(r)
        !$acc end data region

        CALL cpu_time(time2)
        WRITE(*,*) "Time taken = ", (time2 - time1), "secs"

        DEALLOCATE(u, r)

  • Sayan

For now, I think this example will suffice.


Perhaps this divergence is because of different front-ends used in the previous and latest version of NVCC. An Nvidia employee in the forums suggested that I use “-ftz=true -prec-div=false -prec-sqrt=false” option to bring my sm_20 program closer to the sm_13 performance. Can you please let me know how I could pass NVCC options to PGI?


There is no way to pass these NVCC options through the PGI drivers. However, I
believe we support the options that you are trying to use. Use the following suboptions to the -ta=nvidia option:

  • *) fastmath: enable the fast math library, which includes faster but lower precision
    implementations of certain math functions

*) flushz: enable flush-to-zero mode

Thank you very much for your advise. I do see a noticeable difference in performance when I use


May I ask why is it not possible to pass NVCC options to PGI driver? I think it could increase the flexibility of the compiler.