Performance question: PGI vs GCC vs Intel C++

Hello:

I’m new to PGI compilers and I’m testing the Community edition using the file shown at the end of this message. There is a rough matrix multiplication implementation in C, but is intended only for testing purposes. My computer is a Lenovo W540 laptop running Debian and has a 4 core Intel Core i7-4800MQ 2.7 GHz processor. I checked my program serial and using OpenMP with all the 4 cores activated (without hyperthreading) with the compilers:

GCC 6.3.0, using the flags -O3 and -fopenmp for parallel execution
ICC 14.0.3 using the flags -O3 and -openmp for parallel execution
PATHCC 16.10 using the flags -fast and -mp for parallel execution

Now, the times in seconds I’ve obtained (average of three executions):

            serial         4 cores        speedup
gcc        34.1            20.1             1.70
icc        35.9            10.5             3.42
pgcc       75.0            21.4             3.50

I’m a bit confused about the times of the PGI compiler. While the serial times for gcc and icc are almost the same, the PGI is more than twice worse. However, the PGI OpenMP speedup is the best of the three compilers.

I’ve obtained almost the same times with PGI using the optimization flags -O2, -O2, and -fast -Mipa=inline,fast. Is this behavior in serial execution normal with PGI? Should I use any other optimization flags?

Thanks

#include<stdio.h>
#include<stdlib.h>
#include <sys/time.h>
#define gettime(a) gettimeofday(a,NULL)
#define usec(t1,t2) (((t2).tv_sec-(t1).tv_sec)*1000000+((t2).tv_usec-(t1).tv_usec))
typedef struct timeval timestruct;
#define SIZE 2000
int main()
{
    size_t i=0,j=0,k=0;
    size_t posA=0,posB=0,posC=0;
    timestruct t1,t2;
    double* A=(double*)malloc(SIZE*SIZE*sizeof(double));
    double* B=(double*)malloc(SIZE*SIZE*sizeof(double));
    double* C=(double*)malloc(SIZE*SIZE*sizeof(double));
    double sum=0.0;
    for(i=0;i<SIZE*SIZE;i++)
    {
        A[i] = (double)i+10.0;
        B[i] = (double)i+10.0;
    }
    gettime(&t1);
#pragma omp parallel for default(none) \
 shared(C,A,B) \
 private(i,j,posC,sum,k,posA,posB)
    for(i=0;i<SIZE;i++)
    {
        for(j=0;j<SIZE;j++)
        {
            posC = i*SIZE+j;
            sum = 0.0;
            for(k=0;k<SIZE;k++)
            {
                posA = i*SIZE+k;
                posB = k*SIZE+j;
                sum += A[posA]*B[posB];
            }
            C[posC] = sum;
        }
    }
    gettime(&t2);
    printf("Time spent: %.5lf seconds\n",(double)usec(t1,t2)/1000000.0);
    free(A);
    free(B);
    free(C);
    return 0;
}

Hello,

One of the tricks used by compiler companies is to not perform the
calculations if they are never checked after the operation. Simple benchmarks
run much faster when this happens.

So if you are going to test for performance, check the answers to keep the
compilers honest.

Matmul always makes OpenMP look good, because matmuls are embarrassingly parallel.

Matmuls can be very fast with fortran coded routines, and very very fast if
you call customized Math Library routines like dgemm to run on the multiple cores of the
CPU. The Intel MKL is very good for this.

Al these are nice, but to see truly outstanding performance, try running the matmul on a GPU. OpenACC will look very close to the OpenMP directives
you have used, but the result will run on the CPU (-ta=multicore) or a GPU
(-ta=tesla) or either (-ta=multicore,tesla) and can take advantage of the thousands of cores on the GPU. Even faster will be calling dgemm from the
CUDA library (CUBLAS) , which is especially tuned for GPU execution.

If you want to compare performance today, you need to check results and
have very large data sets to see what is possible with home systems
now with GPUs.

dave

Thank you for your answer.

I’ve modified the code (it can be seen at the end of the message) in order to made result checking and also to use different B matrix in the product, but the timings are the same as in the previous example: PGI is outperformed 2x in serial by GCC and Intel and also using OpenMP by Intel (similar results against GCC), although the speedup in OpenMP is the best.

Usually I use OpenBLAS, ATLAS or MKL for matrix multiplication, but I’ve done this simple test in order to check the performance of the PGI Community (although I understand that the performance can vary depending on the program to compile and run)

#include<stdio.h>
#include<stdlib.h>
#include <sys/time.h>
#define gettime(a) gettimeofday(a,NULL)
#define usec(t1,t2) (((t2).tv_sec-(t1).tv_sec)*1000000+((t2).tv_usec-(t1).tv_usec))
typedef struct timeval timestruct;
#define SIZE 2000
int main()
{
    size_t i=0,j=0,k=0;
    timestruct t1,t2;
    double* A=(double*)malloc(SIZE*SIZE*sizeof(double));
    double* B=(double*)malloc(SIZE*SIZE*sizeof(double));
    double* B1=(double*)malloc(SIZE*SIZE*sizeof(double));
    double* C=(double*)malloc(SIZE*SIZE*sizeof(double));
    double* D=(double*)malloc(SIZE*SIZE*sizeof(double));
    double sum=0.0;
    for(i=0;i<SIZE*SIZE;i++){
        A[i] = (double)i+10.0;
        B[i] = (double)i+10.0;
        B1[i] = (double)i+10.0;
    }
#pragma omp parallel for default(none) shared(C,A,B) private(i,j,sum,k)
    for(i=0;i<SIZE;i++){
        for(j=0;j<SIZE;j++){
            sum = 0.0;
            for(k=0;k<SIZE;k++){
                sum += A[i*SIZE+k]*B[k*SIZE+j];
            }
            C[i*SIZE+j] = sum;
        }
    }
    gettime(&t1);
#pragma omp parallel for default(none) shared(D,A,B1) private(i,j,sum,k)
    for(i=0;i<SIZE;i++){
        for(j=0;j<SIZE;j++){
            sum = 0.0;
            for(k=0;k<SIZE;k++){
                sum += A[i*SIZE+k]*B1[k*SIZE+j];
            }
            D[i*SIZE+j] = sum;
        }
    }
    gettime(&t2);
    printf("Time spent: %.5lf seconds\n",(double)usec(t1,t2)/1000000.0);
    for(i=0;i<SIZE*SIZE;i++){
        if(C[i]!=D[i]){
            printf("Different result!\n");
        }
    }
    free(A);
    free(B);
    free(B1);
    free(C);
    free(D);
    return 0;
}

I have no argument with your results. They make sense.
The community edition is the same as the professional one in 16.10.

If I were looking for the best performance on the Intel CPU for toy benchmarks
like a 2000 X 2000 real*8 matmul, I would probably choose Intel compilers.

Assuming you intend to develop/improve a real application with significant
data sets requiring a great deal of calculations, I would first try to get the
entire application running with one of the compilers. Once it is running, I would
look to where the most work is happening, and add OpenMP directives to
parallelize sections for the CPU. Alternatively, I would use OpenACC directives
to run parallel sections on either multiple cores of the CPU, or on one or more GPUs.

Once the code is running serially, each parallel section you enable with
directives can result in a working program that uses CPUs and GPUs, or
you can compile the program ignoring the directives and you should have
the original program back. An incremental development path where each
step has a sanity check of the original working code, or only enable the parallel sections on each iteration

You can always have a working program, and once you have added directives, you can determine the correctness with each new section code
added to parallel execution.

You can also learn that moving small data sets to and from GPUs can be slower than running on a single CPU without data movement.

Good luck with what you are doing. There is a lot to discover about performance and development,

dave

Thank you for your answer.

I’ve done also several tests with the software described in http://www.sciencedirect.com/science/article/pii/S0098300415300200 which can be downloaded from https://bitbucket.org/jgpallero/rls and I get (for gcc and icc) similar results to the ones obtained for the matrix multiplication example. In this case I’ve performed the tests using also Debian but with a 4 core Intel Core i5-2500 3.3 GHz processor. I’ve used as input file the one called ‘eurasia’, that can be downloaded from https://bitbucket.org/jgpallero/rls/downloads/ Once compiled, the execution command was ./bin/rls -t2 -g …/eurasia > …/eur_simplified The obtained times (in seconds) have been:

            serial         4 cores        speedup
gcc        48.8            28.5             1.71
icc        38.9            26.2             1.48
pgcc       Inf             Inf

The different serial programs can be generated as make (default with gcc), make CC=icc and make CC=pgcc. For the OpenMP version the option PAROMP=s should be added in all cases.

As it can be seen, the icc serial version is faster than the gcc one, but in this case the gcc OpenMP speedup is better than the icc one. But it is very strange the PGI behavior. In both, serial and OpenMP versions, the execution time goes to infinity. I can’t understand this behavior and I don’t know if I’ve selected wrong flags for the compilation steps (-c99 -Minform=warn -fast -Msmartalloc).

Could anyone check the software in other processor, please?

Thanks

Hi,

Have you tried using the flags to target the exact CPU you are using?
[e.g. for intel, -xHost, for GNU, -mtune=native (I am not sue the PGI flag)].

It could be that the intel and gnu are using AVX2 vectorization and the PGI is not because it was not given the right flag.

The AVX2 vector length for double precision is “2”…

Just a thought.

Me again,

Just to mention that I have run a very large PCG code that runs over 20% faster in serial mode with PGI than Intel, so it could just be very problem dependent.