I was recently testing a few things for GPU optimization and trying to learn how to use Nsight Compute. The first simple kernel I tried was the STREAMS triad kernel. I wanted to compare OpenACC and OpenMP (via the nvfortran compiler) for this task.
The OpenACC code was compiled with: nvfortran -acc [executable]
The OpenMP code was compiled with: nvfortran -mp=gpu [executable]
OpenACC code:
!$acc data copyin(b(1:n),c(1:n)) create(a(1:n))
!$acc parallel loop
do i = 1, n
a(i) = b(i) + c(i) * alpha
end do
!$acc end data
OpenMP code:
!$omp target data map(to: b(1:n),c(1:n)) map(from: a(1:n))
!$omp target teams distribute parallel do
do i = 1,n
a(i) = b(i) + c(i) * alpha
end do
!$omp end target data
In Nsight Compute, the execution times were exactly the same (I tried with C and C++ too and OpenMP was actually a bit faster than OpenACC in this case, on account of slightly greater memory throughput than the OpenACC version). However, OpenMP had a much greater compute throughput than the OpenACC version, which I found odd.
In the above figure → green: OpenACC; blue: OpenMP.
Where are these extra FLOPs coming from? Also, this behavior seemed to coincide with an increase in FMAs in the OpenMP version. This happened whether or not I compiled the OpenACC code with -Mfma.
If any other information is needed, I would be happy to provide it.
Thanks!
Edit 1:
I just wanted to add that I also tested with CUDA and there was higher utilization of the L1 cache (actually both OpenACC and OpenMP codes made no use of the L1 cache at all).
I did my best to replicate this using N=1GB and running on an A100.
In my case, the OpenACC kernel is faster than the OpenMP target version (10.35ms vs 13.24) but I see a similar SOL though smaller memory % for OpenMP (80% for ACC, 62.5% for OMP). SOL for compute is about 35% for ACC and 77% for OMP. I suspect you’re using a V100 or using a different size for n which can change the profile.
In my case, the primary difference is due to the schedule. OpenACC is maxing the grid size to 64K blocks while OpenMP is using 8,388,608 blocks. Both use 128 threads per block. In other words, OpenMP is using 1 thread per iteration, while OpenACC is having each thread execute multiple loop iterations per thread. This gives the OpenACC better achieved occupancy (98.61% vs 83.97%) and more bandwidth bound.
If I set num_teams to 65535 to match what’s being done in OpenMP, the time reduces to 9.67 ms, so slightly better than the OpenACC version. The compute SOL goes down to 15%, memory SOL up to 85%, and achieved occupancy up to 97.65%. The denser compute helps here.
It does appear were doing a better job on instruction selection with OpenMP, in particular the address offset computation, which likely accounts for the better performance here. I’m guessing this is because our GPU team have been focused on OpenMP performance and haven’t put new strategies back into the OpenACC code gen as of yet.
I just wanted to add that I also tested with CUDA and there was higher utilization of the L1 cache (actually both OpenACC and OpenMP codes made no use of the L1 cache at all).
This is a streaming benchmark with no data re-use, so I wouldn’t expect much if any L1 caching. Not sure what you’re doing in the CUDA version. Maybe “alpha” is getting put in constant memory?