Fortran with OpenMP almost no speedup

yuanrui124 · August 11, 2014, 2:16am

Hello!

I use PGI Visual Fortran 13.9 and OpenMP to accelerate my program (the -mp option has been chosen). I use the Core 2 T5750 with 2 cores and obtain 1.6 times speedup. However, when I switch to i7-4710HQ and i7-4770 with 4 cores, respectively, the speedup is only 1.2-1.5 times.

I use the Intel Visual Fortran 2013 to compile the same code, and the OpenMP version can speedup 4 times when using i7 CPUs.

I want to know why? Thank you!

Nightwish

cparrott_pgroup · August 11, 2014, 6:01pm

Hi Nightwish,

It would be difficult to answer your question without seeing a snippet of the code using OpenMP. There are a lot of possible reasons for the lack of scaling, but it certainly could be possible that the PGI runtime is handling your OpenMP example in a less optimal way than the Intel compiler. Could you provide a code snippet to help us see exactly what might be going on here?

Thanks,

+chris

yuanrui124 · August 12, 2014, 12:34am

Thank you, chris!

Can you give me an E-mail address and I will send you the source code.

Thanks,

Nightwish

MatColgrove · August 12, 2014, 7:53pm

Hi Nightwish,

You can send code to PGI Customer Service (trs@pgroup.com) and then ask them to forward it on to Chris.

Mat

yuanrui124 · August 13, 2014, 12:51am

I have posted the code to the E-mail address.

Thank you!

Nightwish

cparrott_pgroup · August 13, 2014, 6:38pm

Hi Nightwish,

I received your code, thanks.

One thing I have observed about your code here is that the running time is very short. If I compile it without the -mp flag, forcing it to single-threaded mode, it runs in less than a second:

cparrott@galaxy $ time ./app

(output deleted)

real 0m0.069s
user 0m0.059s
sys 0m0.004s

Now if I recompile it again with -mp, and run it at OMP_NUM_THREADS=1, it actually slows down very slightly, probably due to the fixed cost overhead of the OpenMP runtime support code in the PGI runtime library:

cparrott@galaxy $ OMP_NUM_THREADS=1 time ./app

(output deleted)

2.92user 0.01system 0:00.34elapsed 844%CPU (0avgtext+0avgdata 15104maxresident)k
0inputs+0outputs (0major+1466minor)pagefaults 0swaps

Note the walltimes here - 0.34 seconds vs. 0.069 seconds without OpenMP. It is very difficult to gauge any kind of meaningful scaling behavior characteristic with such short running times.

Do you have another example which runs longer?

Best regards,

+chris

yuanrui124 · August 16, 2014, 4:17pm

Hi chris!

I’m sorry there may be some trouble in the codes I sent to you previously. The exe needs to read some *.csv files as the input data, in the previous version I use the fixed path, the directory is changed in your PC, the exe can not open the csv files, so the command window may shut down rapidly.

I have sent you the revised version which reads the csv data files from the release directory and can change the current directory automatically. Please test the code by using the revised version.

Thank you very much!

Nightwish

cparrott_pgroup · August 18, 2014, 6:11pm

Hi Nightwish,

I received your updated code, thanks for sending it.

I will look at it and follow up when I have more information.

Thanks,

+chris

cparrott_pgroup · August 18, 2014, 9:31pm

Nightwish,

I am still looking at the performance of your code.

However, I did observe one semantic problem with it:

When I compiled your code at -O0 -mp, I got a crash. Turns out you were accessing some shared variables inside a subroutine in a non-threadsafe manner. Probably the easiest fix is to change init_genrand() as follows:

!$omp critical
mt(0) = seed
!$omp end critical
latest = seed
DO mti = 1, n-1
  latest = IEOR( latest, ISHFT( latest, -30 ) )
  latest = latest * 1812433253 + mti
!$omp critical
  mt(mti) = latest
!$omp end critical

I will follow up further as I know more.

Best regards,

+chris

cparrott_pgroup · August 18, 2014, 10:53pm

Hi Nightwish,

I profiled your application, and a few things stand out. You may perhaps want to revisit some design choices in your code, and look for ways to better optimize it.

This profile accounts for nearly 90% of the running time of your application:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 22.64    877.00   877.00   860317     0.00     0.00  sceuafunction_xajsimulate2_
 19.17   1619.41   742.41                             _mp_barrier_tw
 14.43   2178.39   558.98                             __hpf_dealloc03
  7.50   2468.89   290.50                             save_alloc
  7.00   2740.15   271.26                             __alloc04
  6.60   2995.99   255.83                             use_alloc
  2.44   3090.50    94.51                             __fmth_i_dlog
  2.11   3172.13    81.63                             reuse_alloc
  1.84   3243.43    71.30                             _mp_p2
  1.42   3298.45    55.02                             __fmth_i_exp
  1.37   3351.64    53.19                             pgf90_alloc04
  1.02   3391.12    39.48                             pgf90_alloc04_chk

Some thoughts:

It looks like sceuafunction_xajsimulate2_ is the main computational portion of your application. Note that only around 23% of the running time (877 seconds) of the code is spent here. If the remaining portions of the total aggregate running time were spent in other functions in your app, this would not be a big deal. However, as you will observe in below, this is not the case - over 60% of the running time of your code is spent in the PGI runtime, for various reasons.
Note how much time is spent in the PGI runtime doing memory management: look at the functions __hpf_dealloc03, save_alloc, __alloc04, use_alloc, reuse_alloc, etc. This adds up to a significant chunk of running time in your application - somewhere around 40%, or roughly 1550 seconds.

I suspect what may be happening is that on every iteration of your main computational loop, you are dynamically allocating and deallocating data structures (memory) used in the computation. You may want to consider optimizing your loop by eliminating any unnecessary allocations and deallocations here. For example, is it possible to allocate the memory only once before the first loop iteration, and then reuse it on subsequent iterations? Then deallocate it after the final iteration? This would eliminate all these potentially unnecessary allocation and deallocation calls, which appear to be slowing down your performance.

Your code is spending around 19% of the total time in OpenMP barriers. (_mp_barrier_tw function in the runtime.) This usually happens at the end of loops, as threads finish. Threads that exit the parallel for loop spin-wait here until the other threads comprising the loop are also finish. This may be indicative of poor load balancing, where certain iterations have a lot more work to do than others. Or, it could be a side effect of the allocation/deallocation observed above. It’s a bit hard to say, as I didn’t profile your code on a per-thread basis, but this might give you some ideas.

Hope this helps.

Best regards,

+chris

yuanrui124 · August 19, 2014, 1:01am

cparrott:

Nightwish,

I am still looking at the performance of your code.

However, I did observe one semantic problem with it:

When I compiled your code at -O0 -mp, I got a crash. Turns out you were accessing some shared variables inside a subroutine in a non-threadsafe manner. Probably the easiest fix is to change init_genrand() as follows:
!$omp critical
mt(0) = seed
!$omp end critical
latest = seed
DO mti = 1, n-1
  latest = IEOR( latest, ISHFT( latest, -30 ) )
  latest = latest * 1812433253 + mti
!$omp critical
  mt(mti) = latest
!$omp end critical
I will follow up further as I know more.

Best regards,

+chris

Thank you chris for your kind reply!

I think it is not a problem or bug, there is no data racing. We don’t need to add critical sections here. Because in source file SceuaFunction.f90 line 111, I defined mt(0:623,0:npt-1),mti(npt), here npt is number of threads. In source file SceuaFunction.f90 line 156 and 158, I defined mt(0:623,0:ngs-1),mti(ngs), here ngs is number of threads.

In line 121 and 123 of SceuaFunction.f90, when calling init_genrand and grnd, I pass only one colum (mt(:,i-1)) and one element (mti(i)) of the shared array mt and mti to the corresponding thread. The same case can be seen in line 169, 199, 243 and so on. Therefore, each thread only reads and writes onto its own colum and element of shared arrays. So, there’s no data racing.

Nightwish

yuanrui124 · August 19, 2014, 1:53am

cparrott:

Hi Nightwish,

I profiled your application, and a few things stand out. You may perhaps want to revisit some design choices in your code, and look for ways to better optimize it.

This profile accounts for nearly 90% of the running time of your application:
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 22.64    877.00   877.00   860317     0.00     0.00  sceuafunction_xajsimulate2_
 19.17   1619.41   742.41                             _mp_barrier_tw
 14.43   2178.39   558.98                             __hpf_dealloc03
  7.50   2468.89   290.50                             save_alloc
  7.00   2740.15   271.26                             __alloc04
  6.60   2995.99   255.83                             use_alloc
  2.44   3090.50    94.51                             __fmth_i_dlog
  2.11   3172.13    81.63                             reuse_alloc
  1.84   3243.43    71.30                             _mp_p2
  1.42   3298.45    55.02                             __fmth_i_exp
  1.37   3351.64    53.19                             pgf90_alloc04
  1.02   3391.12    39.48                             pgf90_alloc04_chk
Some thoughts:

It looks like sceuafunction_xajsimulate2_ is the main computational portion of your application. Note that only around 23% of the running time (877 seconds) of the code is spent here. If the remaining portions of the total aggregate running time were spent in other functions in your app, this would not be a big deal. However, as you will observe in below, this is not the case - over 60% of the running time of your code is spent in the PGI runtime, for various reasons.

Note how much time is spent in the PGI runtime doing memory management: look at the functions __hpf_dealloc03, save_alloc, __alloc04, use_alloc, reuse_alloc, etc. This adds up to a significant chunk of running time in your application - somewhere around 40%, or roughly 1550 seconds.

I suspect what may be happening is that on every iteration of your main computational loop, you are dynamically allocating and deallocating data structures (memory) used in the computation. You may want to consider optimizing your loop by eliminating any unnecessary allocations and deallocations here. For example, is it possible to allocate the memory only once before the first loop iteration, and then reuse it on subsequent iterations? Then deallocate it after the final iteration? This would eliminate all these potentially unnecessary allocation and deallocation calls, which appear to be slowing down your performance.

Your code is spending around 19% of the total time in OpenMP barriers. (_mp_barrier_tw function in the runtime.) This usually happens at the end of loops, as threads finish. Threads that exit the parallel for loop spin-wait here until the other threads comprising the loop are also finish. This may be indicative of poor load balancing, where certain iterations have a lot more work to do than others. Or, it could be a side effect of the allocation/deallocation observed above. It’s a bit hard to say, as I didn’t profile your code on a per-thread basis, but this might give you some ideas.

Hope this helps.

Best regards,

+chris

Thank you very much for your kind reply!

I do agree with your advices. In the parallel region there exists many allocate and deallocate codes. I think if I follow your advice it may run faster. I also remember that when I use benchmark functions that do not use allocate and deallocate in the parralel region, I can obtain very good speedup. For i7 4 cores CPU, I usually obtain 6-7 times speedup.

Unfortunately, if I follow your advice I have to revise too many codes. Additionally, for some arrays in the parallel region, I do not know the sizes in advance and have to allocate it dynamically.

For these reasons, I accept your advice and do agree with you. Thank you very much for your hard work!

By the way, because the Intel compiler can speedup 4 times in i7 4 cores CPUs, and the revison of the PGI code may takes me too much time, I won’t reivse the code now and will revise it when I have enough time. I determine to use the Intel compiled version.

I also writing a CUDA Fortran version of this program, when I finish it, I will compare the speedup of the CUDA version with the Intel CPU serial and OpenMP versions.

Thank you again!

Best regards,

Nightwish

cparrott_pgroup · August 19, 2014, 11:18pm

Nightwish,

yuanrui124:

cparrott:
Nightwish,

I am still looking at the performance of your code.

However, I did observe one semantic problem with it:

When I compiled your code at -O0 -mp, I got a crash. Turns out you were accessing some shared variables inside a subroutine in a non-threadsafe manner. Probably the easiest fix is to change init_genrand() as follows:
!$omp critical
mt(0) = seed
!$omp end critical
latest = seed
DO mti = 1, n-1
  latest = IEOR( latest, ISHFT( latest, -30 ) )
  latest = latest * 1812433253 + mti
!$omp critical
  mt(mti) = latest
!$omp end critical
I will follow up further as I know more.

Best regards,

+chris
Thank you chris for your kind reply!

I think it is not a problem or bug, there is no data racing. We don’t need to add critical sections here. Because in source file SceuaFunction.f90 line 111, I defined mt(0:623,0:npt-1),mti(npt), here npt is number of threads. In source file SceuaFunction.f90 line 156 and 158, I defined mt(0:623,0:ngs-1),mti(ngs), here ngs is number of threads.

In line 121 and 123 of SceuaFunction.f90, when calling init_genrand and grnd, I pass only one colum (mt(:,i-1)) and one element (mti(i)) of the shared array mt and mti to the corresponding thread. The same case can be seen in line 169, 199, 243 and so on. Therefore, each thread only reads and writes onto its own colum and element of shared arrays. So, there’s no data racing.

Nightwish

You are correct - the crash I saw was apparently unrelated to this. Sorry, please disregard my previous comment here!

+chris

cparrott_pgroup · August 19, 2014, 11:19pm

yuanrui124:

cparrott:
Hi Nightwish,

I profiled your application, and a few things stand out. You may perhaps want to revisit some design choices in your code, and look for ways to better optimize it.

This profile accounts for nearly 90% of the running time of your application:
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 22.64    877.00   877.00   860317     0.00     0.00  sceuafunction_xajsimulate2_
 19.17   1619.41   742.41                             _mp_barrier_tw
 14.43   2178.39   558.98                             __hpf_dealloc03
  7.50   2468.89   290.50                             save_alloc
  7.00   2740.15   271.26                             __alloc04
  6.60   2995.99   255.83                             use_alloc
  2.44   3090.50    94.51                             __fmth_i_dlog
  2.11   3172.13    81.63                             reuse_alloc
  1.84   3243.43    71.30                             _mp_p2
  1.42   3298.45    55.02                             __fmth_i_exp
  1.37   3351.64    53.19                             pgf90_alloc04
  1.02   3391.12    39.48                             pgf90_alloc04_chk
Some thoughts:

It looks like sceuafunction_xajsimulate2_ is the main computational portion of your application. Note that only around 23% of the running time (877 seconds) of the code is spent here. If the remaining portions of the total aggregate running time were spent in other functions in your app, this would not be a big deal. However, as you will observe in below, this is not the case - over 60% of the running time of your code is spent in the PGI runtime, for various reasons.

Note how much time is spent in the PGI runtime doing memory management: look at the functions __hpf_dealloc03, save_alloc, __alloc04, use_alloc, reuse_alloc, etc. This adds up to a significant chunk of running time in your application - somewhere around 40%, or roughly 1550 seconds.

I suspect what may be happening is that on every iteration of your main computational loop, you are dynamically allocating and deallocating data structures (memory) used in the computation. You may want to consider optimizing your loop by eliminating any unnecessary allocations and deallocations here. For example, is it possible to allocate the memory only once before the first loop iteration, and then reuse it on subsequent iterations? Then deallocate it after the final iteration? This would eliminate all these potentially unnecessary allocation and deallocation calls, which appear to be slowing down your performance.

Your code is spending around 19% of the total time in OpenMP barriers. (_mp_barrier_tw function in the runtime.) This usually happens at the end of loops, as threads finish. Threads that exit the parallel for loop spin-wait here until the other threads comprising the loop are also finish. This may be indicative of poor load balancing, where certain iterations have a lot more work to do than others. Or, it could be a side effect of the allocation/deallocation observed above. It’s a bit hard to say, as I didn’t profile your code on a per-thread basis, but this might give you some ideas.

Hope this helps.

Best regards,

+chris
Thank you very much for your kind reply!

I do agree with your advices. In the parallel region there exists many allocate and deallocate codes. I think if I follow your advice it may run faster. I also remember that when I use benchmark functions that do not use allocate and deallocate in the parralel region, I can obtain very good speedup. For i7 4 cores CPU, I usually obtain 6-7 times speedup.

Unfortunately, if I follow your advice I have to revise too many codes. Additionally, for some arrays in the parallel region, I do not know the sizes in advance and have to allocate it dynamically.

For these reasons, I accept your advice and do agree with you. Thank you very much for your hard work!

By the way, because the Intel compiler can speedup 4 times in i7 4 cores CPUs, and the revison of the PGI code may takes me too much time, I won’t reivse the code now and will revise it when I have enough time. I determine to use the Intel compiled version.

I also writing a CUDA Fortran version of this program, when I finish it, I will compare the speedup of the CUDA version with the Intel CPU serial and OpenMP versions.

Thank you again!

Best regards,

Nightwish

Hi Nightwish,

Glad I could help. I will note your observation about runtime performance of dynamic memory allocation vs. the Intel compiler as a potential future RFE. Perhaps there is something we could do better on our end here.

Best regards,

+chris

yuanrui124 · August 20, 2014, 3:00am

cparrott:

Nightwish,
yuanrui124:
cparrott:
Nightwish,

I am still looking at the performance of your code.

However, I did observe one semantic problem with it:

When I compiled your code at -O0 -mp, I got a crash. Turns out you were accessing some shared variables inside a subroutine in a non-threadsafe manner. Probably the easiest fix is to change init_genrand() as follows:
!$omp critical
mt(0) = seed
!$omp end critical
latest = seed
DO mti = 1, n-1
  latest = IEOR( latest, ISHFT( latest, -30 ) )
  latest = latest * 1812433253 + mti
!$omp critical
  mt(mti) = latest
!$omp end critical
I will follow up further as I know more.

Best regards,

+chris
Thank you chris for your kind reply!

I think it is not a problem or bug, there is no data racing. We don’t need to add critical sections here. Because in source file SceuaFunction.f90 line 111, I defined mt(0:623,0:npt-1),mti(npt), here npt is number of threads. In source file SceuaFunction.f90 line 156 and 158, I defined mt(0:623,0:ngs-1),mti(ngs), here ngs is number of threads.

In line 121 and 123 of SceuaFunction.f90, when calling init_genrand and grnd, I pass only one colum (mt(:,i-1)) and one element (mti(i)) of the shared array mt and mti to the corresponding thread. The same case can be seen in line 169, 199, 243 and so on. Therefore, each thread only reads and writes onto its own colum and element of shared arrays. So, there’s no data racing.

Nightwish
You are correct - the crash I saw was apparently unrelated to this. Sorry, please disregard my previous comment here!

+chris

Thank you for your attention! It doesn’t matter, thank you all the same!

Nightwish

yuanrui124 · August 20, 2014, 3:01am

cparrott:

yuanrui124:
cparrott:
Hi Nightwish,

I profiled your application, and a few things stand out. You may perhaps want to revisit some design choices in your code, and look for ways to better optimize it.

This profile accounts for nearly 90% of the running time of your application:
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 22.64    877.00   877.00   860317     0.00     0.00  sceuafunction_xajsimulate2_
 19.17   1619.41   742.41                             _mp_barrier_tw
 14.43   2178.39   558.98                             __hpf_dealloc03
  7.50   2468.89   290.50                             save_alloc
  7.00   2740.15   271.26                             __alloc04
  6.60   2995.99   255.83                             use_alloc
  2.44   3090.50    94.51                             __fmth_i_dlog
  2.11   3172.13    81.63                             reuse_alloc
  1.84   3243.43    71.30                             _mp_p2
  1.42   3298.45    55.02                             __fmth_i_exp
  1.37   3351.64    53.19                             pgf90_alloc04
  1.02   3391.12    39.48                             pgf90_alloc04_chk
Some thoughts:

It looks like sceuafunction_xajsimulate2_ is the main computational portion of your application. Note that only around 23% of the running time (877 seconds) of the code is spent here. If the remaining portions of the total aggregate running time were spent in other functions in your app, this would not be a big deal. However, as you will observe in below, this is not the case - over 60% of the running time of your code is spent in the PGI runtime, for various reasons.

Note how much time is spent in the PGI runtime doing memory management: look at the functions __hpf_dealloc03, save_alloc, __alloc04, use_alloc, reuse_alloc, etc. This adds up to a significant chunk of running time in your application - somewhere around 40%, or roughly 1550 seconds.

I suspect what may be happening is that on every iteration of your main computational loop, you are dynamically allocating and deallocating data structures (memory) used in the computation. You may want to consider optimizing your loop by eliminating any unnecessary allocations and deallocations here. For example, is it possible to allocate the memory only once before the first loop iteration, and then reuse it on subsequent iterations? Then deallocate it after the final iteration? This would eliminate all these potentially unnecessary allocation and deallocation calls, which appear to be slowing down your performance.

Your code is spending around 19% of the total time in OpenMP barriers. (_mp_barrier_tw function in the runtime.) This usually happens at the end of loops, as threads finish. Threads that exit the parallel for loop spin-wait here until the other threads comprising the loop are also finish. This may be indicative of poor load balancing, where certain iterations have a lot more work to do than others. Or, it could be a side effect of the allocation/deallocation observed above. It’s a bit hard to say, as I didn’t profile your code on a per-thread basis, but this might give you some ideas.

Hope this helps.

Best regards,

+chris
Thank you very much for your kind reply!

I do agree with your advices. In the parallel region there exists many allocate and deallocate codes. I think if I follow your advice it may run faster. I also remember that when I use benchmark functions that do not use allocate and deallocate in the parralel region, I can obtain very good speedup. For i7 4 cores CPU, I usually obtain 6-7 times speedup.

Unfortunately, if I follow your advice I have to revise too many codes. Additionally, for some arrays in the parallel region, I do not know the sizes in advance and have to allocate it dynamically.

For these reasons, I accept your advice and do agree with you. Thank you very much for your hard work!

By the way, because the Intel compiler can speedup 4 times in i7 4 cores CPUs, and the revison of the PGI code may takes me too much time, I won’t reivse the code now and will revise it when I have enough time. I determine to use the Intel compiled version.

I also writing a CUDA Fortran version of this program, when I finish it, I will compare the speedup of the CUDA version with the Intel CPU serial and OpenMP versions.

Thank you again!

Best regards,

Nightwish
Hi Nightwish,

Glad I could help. I will note your observation about runtime performance of dynamic memory allocation vs. the Intel compiler as a potential future RFE. Perhaps there is something we could do better on our end here.

Best regards,

+chris

Thank you for your kind reply!

Wish PGI Fortran could be better.

Nightwish