OpenMP + OpenACC problem

Hi all!

I have a problem with multiGPU. Code runs 100 times slower than on a single gpu.

Single GPU version.

Memory allocation:

!$acc data copyin(h1(0:n1), h2(0:n2), h3(0:n3), h1plus(-1:n1), h2plus(-1:n2), h3plus(-1:n3))&
!$acc copyin(u(-1:n3,-1:n2,-1:n1 + 1), v(-1:n3,-1:n2 + 1,-1:n1)) &
!$acc copyin(w(-1:n3 + 1,-1:n2,-1:n1), t(0:n3,-1:n2 + 1,-1:n1 + 1)) &
!$acc copyin(p(0:n3 - 1,0:n2 - 1,0:n1 - 1), eig(0:n3 - 1,0:n2 - 1,0:n1 - 1)) &
!$acc create(dp(0:n3 - 1,0:n2 - 1,0:n1 - 1), u_new(-1:n3,-1:n2,-1:n1 + 1), v_new(-1:n3,-1:n2 + 1,-1:n1), w_new(-1:n3 + 1,-1:n2,-1:n1), t_new(0:n3,-1:n2 + 1,-1:n1 + 1))

After it I use kernels to run several 3d loops:

!$acc kernels present(h1,h2,h3,h1plus,h2plus,h3plus,u,v,w,u_new,v_new,w_new,p,t)
!$acc loop collapse(3) gang
	do i = 1, n1 - 1
		do j = 0, n2 - 1
			do k = 0, n3 - 1
...
         end do
		end do
	end do
!$acc loop collapse(3) gang
	do i = 0, n1 - 1
		do j = 1, n2 - 1
			do k = 0, n3 - 1
...
			end do
		end do
	end do
!$acc loop collapse(3) gang
	do i = 0, n1 - 1
		do j = 0, n2 - 1
			do k = 1, n3 - 1
...
			end do
		end do
	end do
!$acc end kernels

It takes 3.0860900878906250E-003 seconds to run such code on single Tesla V100.

Since there are 4 GPUs available on a single cluster node, I decided to take advantage of them.

Multiple GPU version.
I changed memory allocation:

!$omp parallel private(iam, devicenum) num_threads(ngpus)
  iam = omp_get_thread_num()
  devicenum = iam + 1
  call acc_set_device_num(devicenum,acc_device_nvidia)

!$acc enter data copyin(h1(0:n1), h2(0:n2), h3(0:n3), h1plus(-1:n1), h2plus(-1:n2), h3plus(-1:n3))&
!$acc copyin(u(-1:n3,-1:n2,-1:n1 + 1), v(-1:n3,-1:n2 + 1,-1:n1)) &
!$acc copyin(w(-1:n3 + 1,-1:n2,-1:n1), t(0:n3,-1:n2 + 1,-1:n1 + 1)) &
!$acc copyin(p(0:n3 - 1,0:n2 - 1,0:n1 - 1), eig(0:n3 - 1,0:n2 - 1,0:n1 - 1)) &
!$acc create(dp(0:n3 - 1,0:n2 - 1,0:n1 - 1), u_new(-1:n3,-1:n2,-1:n1 + 1), v_new(-1:n3,-1:n2 + 1,-1:n1), w_new(-1:n3 + 1,-1:n2,-1:n1), t_new(0:n3,-1:n2 + 1,-1:n1 + 1))
!$omp end parallel

I divided work between GPUs in the last dimension. Starts and stops are now stored in arrays:

!$omp parallel private (iam) num_threads(ngpus)
  iam = omp_get_thread_num()

!$acc kernels present(h1,h2,h3,h1plus,h2plus,h3plus,u,v,w,u_new,v_new,w_new,p,t) 
!$acc loop collapse(3) gang 
	do i = 1, n1 - 1
		do j = 0, n2 - 1
			do k = UVPz_start(iam), UVPz_end(iam)
...
      end do
		end do
	end do
!$acc loop collapse(3) gang 
	do i = 0, n1 - 1
		do j = 1, n2 - 1
			do k = UVPz_start(iam), UVPz_end(iam)
...
			end do
		end do
	end do
!$acc loop collapse(3) gang
	do i = 0, n1 - 1
		do j = 0, n2 - 1
			do k = Wz_start(iam), UVPz_end(iam)
...
			end do
		end do
	end do
!$acc end kernels

!$omp end parallel

Time for this code to run is 0.3 sec.

I don’t really understand what am I doing wrong. Can somebody help me with this?

P.S. I already sent my code to Mat here https://forums.developer.nvidia.com/t/openacc-on-gpu-help/135676/1

Also I would like to know if OpenACC uses peer-to-peer transfers, since I will need to use fft similar to the link above and I don’t have much time to create MPI program.

Hi GR3EM,

I’m not seeing anything wrong with the new code snip-its you posted and it seems like the correct way to include OpenMP.

Have you done any performance analysis to see where the extra time is coming from?

I see that you have some internal timers around the “kernels” regions, but it would be good to run the code through a profiler such as PGPROF and compare the device kernel execution times, the data transfer times, as well as the CPU time.

Full details on using PGPROF can be found at: https://www.pgroup.com/resources/docs/19.1/x86/profiler-users-guide/index.htm

Note, I’d suggest using the pgprof options “–cpu-profiling-thread-mode separated --cpu-profiling-mode top-down”. By default, the profile aggregates the OpenMP thread times.

If the aggregate kernel times are about the same, it may be an issue with the OpenMP such as overhead or a binding problem. Be sure to bind the OpenMP thread to a CPU core on the socket the GPU is attached. To get this info, run “nvidia-smi topo -m”. For example, my 4 V100 system shows the following:

% nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    CPU Affinity
GPU0     X      NODE    SYS     SYS     0-19,40-59
GPU1    NODE     X      SYS     SYS     0-19,40-59
GPU2    SYS     SYS      X      NODE    20-39,60-79
GPU3    SYS     SYS     NODE     X      20-39,60-79

So here, I’d want to bind threads 0 and 1, to the first socket (cores 0-19) and threads 2 and 3 to the second (cores 20-39).

Also I would like to know if OpenACC uses peer-to-peer transfers, since I will need to use fft similar to the link above and I don’t have much time to create MPI program.

Not directly, but if you use an MPI, such as OpenMPI, with CUDA Aware MPI built in, then the device data will be transferred peer-to-peer.

See: https://www.open-mpi.org/faq/?category=runcuda

When transferring device data, put your MPI calls inside of an OpenACC “host_data” region so a device address to the device data is passed to the call rather than the host address.

See Course #2 at: https://developer.nvidia.com/openacc-advanced-course for a in-depth training on using MPI+OpenACC.

-Mat

Hi Mat, thanks for quick response.

Profiling multi GPU
Profiling single GPU

nvidia-smi shows me this:

%
	GPU0	GPU1	GPU2	GPU3	mlx5_0	mlx5_1	CPU Affinity
GPU0	 X 	PIX	NODE	NODE	NODE	NODE	0-15,32-47
GPU1	PIX	 X 	NODE	NODE	NODE	NODE	0-15,32-47
GPU2	NODE	NODE	 X 	PIX	PIX	NODE	0-15,32-47
GPU3	NODE	NODE	PIX	 X 	PIX	NODE	0-15,32-47
mlx5_0	NODE	NODE	PIX	PIX	 X 	NODE	
mlx5_1	NODE	NODE	NODE	NODE	NODE	 X

How can I bind threads now?

There is also another problem. All 3 kernels can run simultaneously, but when I add async clause in nvvp I can see that kernels are not overlapping, in fact distance between launching kernels increased. Is anything i can do about that?

I aslo attach link to profiler results if u want to view huuge gaps between kernel launches on different gpus.

nvvp

How can I bind threads now?

Looks like you have a single socket systems, so binding wont matter as much, but wont hurt.

There are several ways to bind CPU threads to cores.

If you are using the PGI non-llvm back-end (default in 18.10), then you can use the variable “export MP_BIND=true” and optionally “MP_BLIST=0,1,2,3” to give the order of the cores to bind.

If you are using the PGI LLVM back-end (default in 19.1, in 18.10 add the “-Mllvm” flag or set your path to “$PGI/linux86-64-llvm/18.10/bin”), use “export OMP_PROC_BIND=true” and optionally “OMP_PLACES={0},{1},{2},{3}” to set the order.

If installed on your system, you can instead use the utilities “numctl” or “taskset”.

numctl -C 0,1,2,3 <a.out>
taskset -c 0,1,2,3 <a.out>


You’re nvvp profiles failed to open on my desktop system (probably due to a CUDA version mismatch). I’ll look at updating my desktop when I get a chance (I normally use the command line profiler).

There is also another problem. All 3 kernels can run simultaneously, but when I add async clause in nvvp I can see that kernels are not overlapping, in fact distance between launching kernels increased. Is anything i can do about that?

Look for any thing that might block the CPU thread such as data coming back from the kernel, such as implicit data copies or reduction variables.

Also, if the kernels are very fast, then the launch overhead may be dominating the overall runtime.

Finally, each OpenMP thread will have their own async queues. Hence, you’ll only want to use async if it makes sense for each OMP thread, not to create asynchronous behavior between threads. Also, there’s an overhead cost of creating queues, so if it’s only used once, best not to use them.

-Mat

Hi!

I have done MPI version but i can’t run executable.

I use MPI that goes with 18.10 community edition compiler. My compilation line is:

/nethome/shchery/pgi/linux86-64/18.10/mpi/openmpi/bin/mpif90 -O3 -mp -Minfo=acc -acc -ta=tesla:cc70 -mcmodel medium -o ../mpiGPU -module modules RB_Solver3D.F90

I don’t add mpif90 to my PATH because I need to use mpirun that goes from

/common/runmvs/bin/mpirun

It uses scheduling.

When i try to run the executable I get error:

./mpiGPU: error while loading shared libraries: libnuma.so: cannot open shared object file: No such file or directory

I can see that libnuma is installed at

/usr/lib64

but somehow i simply doesnt work… Can you help me with that?

Try and symlink the stub libnuma.so inside linux86-64/18.10/lib to point to the real libnuma.

E.g. ln -s /path/to/pgi/linux86-64/18.10/lib/libnuma.so /usr/lib64/libnuma.so

Hi!

If I try to do it I get an error that file «/usr/lib64/libnuma.so» exists. But if I do -sf then I get an error: Permision denied. Cant delete «/usr/lib64/libnuma.so».

Hi all!

Can anybody help me with my problem?

compiling with -mp=nonuma doesn’t help.

If I add /usr/lib64 to my LD_LIBRARY_PATH then ldd of my executable shows libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007f85fb7ce000)

But it still doesn’t run…

Hi GR3EM,

Usually setting LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH fixes these types of issues.

Are you running on a remote node that doesn’t have libnuma.so installed?

-Mat