My code for OpenMP not speeding up the processing instead added more time to the overall performance… I suspect it’s due to the OpenMp overhead that caused the time addition, perhaps I just want to confirm if my suspicions are right or perhaps I did something wrong/can improve the code.
Would someone please correct me what i’m doing wrong in an attempt to parallelise this nested loops.
Thanks!
!$omp parallel private(i,j,detax,detay,l,alpha_k, alpha) shared (x_c_den, dis_pair)
!$omp do
do j=1, n_cp
do i=1, n_pairs(j)
detax=x_c_den(1, pairs(i,j))- x_c_den(1, j)
detay=x_c_den(2, pairs(i,j))- x_c_den(2, j)
l= dis_pair(i,j)
if (l<2.0*h .and. l>0.0) then
if (abs(detay)<1.0e-5) then
if (detax>0) then
alpha_k=0.0
else if (detax<0) then
alpha_k=180.0
endif
else if (abs(detax)<1.0e-5) then
if (detay>0) then
alpha_k=90.0
else if (detay<0) then
alpha_k=-90.0
endif
else if (detax>0.0) then
alpha_k=atan(detay/detax)*r_to_d
else if (detax<0.0 .and. detay>0.0) then
alpha_k=atan(detay/detax)*r_to_d+180
else if (detax<0.0 .and. detay<0.0) then
alpha_k=atan(detay/detax)*r_to_d-180
endif
alpha=acos(0.5*l/h)*r_to_d
angle1(i, j)=alpha_k-alpha
angle2(i, j)=alpha_k+alpha
if (angle1(i, j)*angle2(i, j)<0.0) then
n_angle2_temp(j)=n_angle2_temp(j)+1
angle2_temp(n_angle2_temp(j), j)=angle2(i,j)
angle2(i,j)=0.0
endif
endif
enddo
enddo
!$omp end do
!$omp end parallel
The problem is that it’s not speeding up the processing instead added more time to the overall performance… I suspect it’s due to the OpenMp overhead that caused the time addition, perhaps I just want to confirm if my suspicions are right or perhaps I did something wrong/can improve the code.
However, now the problem is that the values are not what I expect. Would you advise on this?
I’d look at the additional collapsed loops to see if there are any race conditions. Are you needing to privatize any variables? Do you need to protect any updates to shared variables with a critical section?
Perhaps you can advise me on how would you modify my code if you were to use it?
I suggest if we could start with a more simple example:
!$omp parallel do private(i,j) collapse(2)
do j=1, n_cp
do i=1, max_p_pairs+max_a_temp
if (angle2_t(i,j)<=0.0 .or. abs(angle2_t(i,j))<1.0e-6) then
if (angle1_t(i,j)<0.0) then
angle1_t(i,j)=angle1_t(i,j)+360
angle2_t(i,j)=angle2_t(i,j)+360
endif
endif
enddo
enddo
!$omp end parallel do
Is this right? while the values are right, the speed up isn’t there.
Are you running on a multi-socket NUMA based system? If so, then one possibility is that the memory location of the arrays is located on one NUMA memory node which will slow down the code when accessed from the processors attached to the another NUMA memory node. To fix, make sure that you initialize your arrays in parallel so that the memory is distributed across the memory nodes (i.e. the “first-touch” rule).
Another consideration is thread to processor binding. Do you see improvements if you either set the environment variable “MP_BIND=true” or if you bind using the “numactl” utility?