Optimized multi-threaded code is very slow


OK, you may have noticed that I keep posting messages regarding openMP . In a previous message, I was reporting a problem with openMP that I managed to resolve, and now my code (which is built with PVF) runs normally. I am using optimization for the build, and I am attaching here the command line options:

-Bstatic -Mbackslash -mp -I"c:\program files\pgi\win64\18.4\include" -I"C:\Program Files\PGI\Microsoft Open Tools 14\include" -I"C:\Program Files (x86)\Windows Kits\10\Include\shared" -I"C:\Program Files (x86)\Windows Kits\10\Include\um" -fast -O3 -tp=haswell-64,penryn-64,p7-64,sandybridge-64,px-64 -Minform=warn

I ran a relatively large problem, and I am confident that my code is well-suited for openMP multi-threaded execution (meaning I should be seeing a speedup, and I do see a speedup compared to the case with 1 thread).

I was surprised to see that my program took 6 minutes to complete when it used 16 threads, while the same exact program, when built with Intel’s compiler (and optimization level O2), takes about 25 seconds!!! I imagine that the PVF-built version should not be about 15 times slower than the Intel-built version.

More importantly, the PVF-built program is SHOCKINGLY SLOW IN READ/WRITE operations. However, I do not think that this is the main reason for the program being so much slower.

I would be grateful if you could provide any thoughts on what might be making the compiler building a program which is so much slower than the corresponding Intel-built version, and what I could do to speed up the read-write operations.

Thank you in advance for your help.

It is hard to know for sure. One possibility is that I believe Intel defaults to use all possible cores. PGI OpenMP defaults to one core unless you set the OMP_NUM_THREADS environment variable or do a similar API call in your program. So check that.

As for file i/o slowness, please let us know what type of file i/o you are doing. Fortran? formatted/unformatted/namelist…


Intel does not use all possible cores. By default, my code uses the number of threads which has been set in the corresponding environment variable of my system. In the beginning of my program, I make a call to omp_set_num_threads(), and set the number of threads to a value that I want (actually, a value that I specify in my input file). Also, I monitor the performance of my system during runtime, and I see that the same number of threads is used for both the Intel and PGI compilers.

The only (far-fetched) possibility that I have come up with is that PGI becomes much slower in cores with hyperthreading (this is the case for my system).

As for i/o, I have a combination of formatted/unformatted, with the majority of writes corresponding to formatted…