OMP loops

I am trying to add omp stuff to use gpus with my code, but so far it’s unsuccessful.

Here a small example of what I tried to do:

!$omp target teams distribute parallel do map(from:sxzi) map(from:d1)
do j=2,this%npart
do i=1,j-1
ij=this%pair(i,j)
do is=1,3
sxzold=this%sxz(:,:,:,idet)
opi=sx15(:,3+is,:,i)
di=opi(1,:)*this%sp(1,i)+opi(2,:)*this%sp(2,i)+opi(3,:)*this%sp(3,i)+opi(4,:)this%sp(4,i)
d1=di(i)
detinv=cone/d1
di=di
detinv
do m=1,this%npart
sxzi(:,:,m)=sxzold(:,:,m)-di(m)*sxzold(:,:,i)
sxzi(:,i,m)=opi(:,m)-di(m)*opi(:,i)
enddo
sxzi(:,:,i)=sxzold(:,:,i)*detinv
sxzi(:,i,i)=opi(:,i)*detinv
enddo
enddo
enddo

where this%npart is a small number (let’s say around 10). If I add the omp part, the execution time becomes much longer. Is that because the loops are not very big? What can I try to improve this? Is it the right way to specify the options to omp?

Hard to say exactly without a reproducing example, but I’d say the time is likely due to data movement, but the small size doesn’t help.

A single streaming multiprocessor (SM) on a GPU can run up to 2048 threads and depending on which GPU you have, there can be hundreds of SMs. Even with a thread taking an single iteration, you’re only using a small fraction of the GPU.

I don’t know how you’re handling device data. Are you using higher level “target data” directives for the arrays like sxz, opi, sp, etc, CUDA Unified memory, or letting the compiler implicitly copy in these arrays? If using UM, then the data gets copied the first time it’s used, which would be in the kernel itself, so that may be the time you’re seeing. If you’re using target data or implicit copy, then the data movement would occur before the kernel.

You might consider running the code through the Nsight-Systems profile since this will give you a better idea on where the time is being spent.

Now you’re code doesn’t look like it’s parallelizable in that you have race conditions that would lead to wrong answer if you run it parallel. Specifically the “sxzold” and “sxzi” array are shared so different threads will be overwriting each other. Since it’s a temp array, you can put “sxzold” in a “private” clause so each thread has it’s own private copy, but you wont be able to privatize “szxi”.

“d1” is also a problem. It looks to be a local scalar used for an intermediary computation so should be private, but you have it in a “map” clause which makes it shared across thread. Multiple threads will be reading “di(i)” at the same time, but one different elements. The value of d1 will be which ever updated it last. So when it computes detinv, all but one thread will be using the wrong value. Also, what value is returned to the host will be non-deterministic.

What can I try to improve this? Is it the right way to specify the options to omp?

You first need to refactor the code so that it’s parallelizable. meaning that each iteration of the “j” loop needs to be independent. Then see if you can also scale up the problem so it can efficiently utilize the GPU. Finally manage the data so that you’re reusing it over multiple kernels or one large kernel in order to amortize the cost.