OpenMP in Do loop (Fortran)

wWilliam · September 28, 2016, 7:54am

Hi,

My code for OpenMP not speeding up the processing instead added more time to the overall performance… I suspect it’s due to the OpenMp overhead that caused the time addition, perhaps I just want to confirm if my suspicions are right or perhaps I did something wrong/can improve the code.

Would someone please correct me what i’m doing wrong in an attempt to parallelise this nested loops.

Thanks!

!$omp parallel private(i,j,detax,detay,l,alpha_k, alpha) shared (x_c_den, dis_pair) 

!$omp do 

do j=1, n_cp
   do i=1, n_pairs(j)

      detax=x_c_den(1, pairs(i,j))- x_c_den(1, j)
	  detay=x_c_den(2, pairs(i,j))- x_c_den(2, j)

      l= dis_pair(i,j)

	  if (l<2.0*h .and. l>0.0) then

		 if (abs(detay)<1.0e-5) then
			if (detax>0) then
			   alpha_k=0.0
			else if (detax<0) then
			   alpha_k=180.0
			endif

		 else if (abs(detax)<1.0e-5) then
			if (detay>0) then
			   alpha_k=90.0
			else if (detay<0) then
			   alpha_k=-90.0
			endif

		 else if (detax>0.0) then
			alpha_k=atan(detay/detax)*r_to_d

		 else if (detax<0.0 .and. detay>0.0) then
            alpha_k=atan(detay/detax)*r_to_d+180

	     else if (detax<0.0 .and. detay<0.0) then
			alpha_k=atan(detay/detax)*r_to_d-180
		 endif

		 alpha=acos(0.5*l/h)*r_to_d

		 angle1(i, j)=alpha_k-alpha
		 angle2(i, j)=alpha_k+alpha

         if (angle1(i, j)*angle2(i, j)<0.0) then
			n_angle2_temp(j)=n_angle2_temp(j)+1
            angle2_temp(n_angle2_temp(j), j)=angle2(i,j)
			angle2(i,j)=0.0
	     endif
	  endif
   enddo
enddo
!$omp end do 
!$omp end parallel

MatColgrove · September 28, 2016, 7:32pm

Hi wWilliam,

What’s the problem you’re seeing?

Mat

wWilliam · September 29, 2016, 1:32am

Hi Mat,

The problem is that it’s not speeding up the processing instead added more time to the overall performance… I suspect it’s due to the OpenMp overhead that caused the time addition, perhaps I just want to confirm if my suspicions are right or perhaps I did something wrong/can improve the code.

Thanks in advance, Mat!

MatColgrove · September 29, 2016, 8:08pm

I suspect it’s due to the OpenMp overhead that caused the time addition

Possible but doubtful.

Have you run your program through a profile to see where the time is being spent?

If you have not used a profiler before or not use PGPROF, you can find the full docs at: PGI Documentation Archive for Versions Prior to 17.7

Mat

wWilliam · September 30, 2016, 1:21am

Hi Matt,

Thanks for the response.

I have looked deeper and have found that if I added

collapse(n)

in different section of my code, it does reduce processing time (by 5 times)!

However, now the problem is that the values are not what I expect. Would you advise on this?

MatColgrove · September 30, 2016, 3:08pm

However, now the problem is that the values are not what I expect. Would you advise on this?

I’d look at the additional collapsed loops to see if there are any race conditions. Are you needing to privatize any variables? Do you need to protect any updates to shared variables with a critical section?

Mat

ceeely · October 7, 2016, 3:23am

Hi Mat,

Thank you for the tip.

Perhaps you can advise me on how would you modify my code if you were to use it?

I suggest if we could start with a more simple example:

!$omp parallel do private(i,j) collapse(2)
do j=1, n_cp
   do i=1, max_p_pairs+max_a_temp 
      if (angle2_t(i,j)<=0.0 .or. abs(angle2_t(i,j))<1.0e-6) then
	     if (angle1_t(i,j)<0.0) then
		    angle1_t(i,j)=angle1_t(i,j)+360
			angle2_t(i,j)=angle2_t(i,j)+360
		 endif
	  endif
   enddo
enddo
!$omp end parallel do

Is this right? while the values are right, the speed up isn’t there.

Thanks in advance!

MatColgrove · October 7, 2016, 3:42pm

Hi Ceeely,

Are you running on a multi-socket NUMA based system? If so, then one possibility is that the memory location of the arrays is located on one NUMA memory node which will slow down the code when accessed from the processors attached to the another NUMA memory node. To fix, make sure that you initialize your arrays in parallel so that the memory is distributed across the memory nodes (i.e. the “first-touch” rule).

Another consideration is thread to processor binding. Do you see improvements if you either set the environment variable “MP_BIND=true” or if you bind using the “numactl” utility?

Mat