I use PGI Visual Fortran 13.9 and OpenMP to accelerate my program (the -mp option has been chosen). I use the Core 2 T5750 with 2 cores and obtain 1.6 times speedup. However, when I switch to i7-4710HQ and i7-4770 with 4 cores, respectively, the speedup is only 1.2-1.5 times.
I use the Intel Visual Fortran 2013 to compile the same code, and the OpenMP version can speedup 4 times when using i7 CPUs.
It would be difficult to answer your question without seeing a snippet of the code using OpenMP. There are a lot of possible reasons for the lack of scaling, but it certainly could be possible that the PGI runtime is handling your OpenMP example in a less optimal way than the Intel compiler. Could you provide a code snippet to help us see exactly what might be going on here?
One thing I have observed about your code here is that the running time is very short. If I compile it without the -mp flag, forcing it to single-threaded mode, it runs in less than a second:
cparrott@galaxy $ time ./app
(output deleted)
real 0m0.069s
user 0m0.059s
sys 0m0.004s
Now if I recompile it again with -mp, and run it at OMP_NUM_THREADS=1, it actually slows down very slightly, probably due to the fixed cost overhead of the OpenMP runtime support code in the PGI runtime library:
Note the walltimes here - 0.34 seconds vs. 0.069 seconds without OpenMP. It is very difficult to gauge any kind of meaningful scaling behavior characteristic with such short running times.
I’m sorry there may be some trouble in the codes I sent to you previously. The exe needs to read some *.csv files as the input data, in the previous version I use the fixed path, the directory is changed in your PC, the exe can not open the csv files, so the command window may shut down rapidly.
I have sent you the revised version which reads the csv data files from the release directory and can change the current directory automatically. Please test the code by using the revised version.
I am still looking at the performance of your code.
However, I did observe one semantic problem with it:
When I compiled your code at -O0 -mp, I got a crash. Turns out you were accessing some shared variables inside a subroutine in a non-threadsafe manner. Probably the easiest fix is to change init_genrand() as follows:
I profiled your application, and a few things stand out. You may perhaps want to revisit some design choices in your code, and look for ways to better optimize it.
This profile accounts for nearly 90% of the running time of your application:
It looks like sceuafunction_xajsimulate2_ is the main computational portion of your application. Note that only around 23% of the running time (877 seconds) of the code is spent here. If the remaining portions of the total aggregate running time were spent in other functions in your app, this would not be a big deal. However, as you will observe in below, this is not the case - over 60% of the running time of your code is spent in the PGI runtime, for various reasons.
Note how much time is spent in the PGI runtime doing memory management: look at the functions __hpf_dealloc03, save_alloc, __alloc04, use_alloc, reuse_alloc, etc. This adds up to a significant chunk of running time in your application - somewhere around 40%, or roughly 1550 seconds.
I suspect what may be happening is that on every iteration of your main computational loop, you are dynamically allocating and deallocating data structures (memory) used in the computation. You may want to consider optimizing your loop by eliminating any unnecessary allocations and deallocations here. For example, is it possible to allocate the memory only once before the first loop iteration, and then reuse it on subsequent iterations? Then deallocate it after the final iteration? This would eliminate all these potentially unnecessary allocation and deallocation calls, which appear to be slowing down your performance.
Your code is spending around 19% of the total time in OpenMP barriers. (_mp_barrier_tw function in the runtime.) This usually happens at the end of loops, as threads finish. Threads that exit the parallel for loop spin-wait here until the other threads comprising the loop are also finish. This may be indicative of poor load balancing, where certain iterations have a lot more work to do than others. Or, it could be a side effect of the allocation/deallocation observed above. It’s a bit hard to say, as I didn’t profile your code on a per-thread basis, but this might give you some ideas.
I think it is not a problem or bug, there is no data racing. We don’t need to add critical sections here. Because in source file SceuaFunction.f90 line 111, I defined mt(0:623,0:npt-1),mti(npt), here npt is number of threads. In source file SceuaFunction.f90 line 156 and 158, I defined mt(0:623,0:ngs-1),mti(ngs), here ngs is number of threads.
In line 121 and 123 of SceuaFunction.f90, when calling init_genrand and grnd, I pass only one colum (mt(:,i-1)) and one element (mti(i)) of the shared array mt and mti to the corresponding thread. The same case can be seen in line 169, 199, 243 and so on. Therefore, each thread only reads and writes onto its own colum and element of shared arrays. So, there’s no data racing.
I do agree with your advices. In the parallel region there exists many allocate and deallocate codes. I think if I follow your advice it may run faster. I also remember that when I use benchmark functions that do not use allocate and deallocate in the parralel region, I can obtain very good speedup. For i7 4 cores CPU, I usually obtain 6-7 times speedup.
Unfortunately, if I follow your advice I have to revise too many codes. Additionally, for some arrays in the parallel region, I do not know the sizes in advance and have to allocate it dynamically.
For these reasons, I accept your advice and do agree with you. Thank you very much for your hard work!
By the way, because the Intel compiler can speedup 4 times in i7 4 cores CPUs, and the revison of the PGI code may takes me too much time, I won’t reivse the code now and will revise it when I have enough time. I determine to use the Intel compiled version.
I also writing a CUDA Fortran version of this program, when I finish it, I will compare the speedup of the CUDA version with the Intel CPU serial and OpenMP versions.
Glad I could help. I will note your observation about runtime performance of dynamic memory allocation vs. the Intel compiler as a potential future RFE. Perhaps there is something we could do better on our end here.