Yes, I understand that there are two issues.
Memory of the private array:
Again, the difference in memory usage is due to how you are scheduling the loops.
For OpenMP here’s the mapping
!$omp target teams distribute parallel do collapse(2) private(temp2)
“teams” → enter a team region (i.e. a CUDA kernel) where a team maps to a CUDA Block
“distribute” → the loops to distribute across the teams
“parallel” → enter a parallel region where the threads in this region map to a CUDA thread
“do” → the loops to distribute across the threads
Adding a “private” clause here means that every CUDA thread will get it’s own copy of the array. Hence the memory usage is: Num_Blocks X Num_Threads_Per_Block X Size_of_Array
For OpenACC:
!$acc parallel loop gang collapse(2) private(temp2)
“parallel” → Enter a compute regions (i.e. a CUDA kernel)
“loop” → Distribute loop iterations across the defined schedule (i.e. either gang, worker, vector, seq, or a combination)
“gang” → Use gangs for the loop distribution (gang maps to a CUDA Block)
Since you don’t specify “vector”, the compiler is free to apply as needed, It your case, it’s auto-parallelizing the inner array syntax across the vectors, but not on the outer loops.
Hence when adding a “private” clause on the gang loop every CUDA Block will get it’s own copy of the array. The memory usage is: Num_Blocks X Size_of_Array, which is much smaller than the OpenMP use case.
In order to reduce the memory usage by OpenMP, and match OpenACC, you would need to remove the “parallel do” from the outer loop so the array is only privatized across the teams (CUDA Blocks).
Note if you change
!$omp target teams distribute parallel do collapse(2) private(temp2)
to
!$omp target teams loop collapse(2) private(temp2)
The compiler will apply a similar schedule as is done in OpenACC, i.e. teams applied to the outer loops, and parallel applied to the inner array syntax. “loop” allows the compiler more freedom to do scheduling as opposed to “distribute parallel do” where you, the programmer, decides.
I am aware of the alternative target teams loop
. My understanding is that this is just a copy of the OpenACC clause for OpenMP, and it is only implemented in NVHPC but not part of the OpenMP standard.
The “loop” directive is part of the OpenMP 5.0 standard: loop Construct
While NVHPC is ahead in implementation of “loop”, other compilers have or are adding support as well. My contacts with other compiler teams see the benefits of “loop”, especially for offloading.
Performance:
The performance issue I mentioned in my second post has nothing to do with the large array reduction. I have tested the code by removing all the lines related to temp2
, and still got the same results as described in post 2.
Ok, I was only testing with “temp2” included, and yes I see that using an outer “gang vector” loop is faster than using a gang outer / inner vector here. I profiled the code using Nsight-Compute and see that the reduction operation does have some impact, but the larger impact is due to memory and the array striding. Note that typically using inner vector helps, it’s just this particular case, it does not.
With “temp2” included, the compiler is vectorizing the array syntax so only gang is applied to the outer loops. In this case, also applying vector to the inner “k” loop helps by 2x. The question to me then is, what’s the performance impact of using “gang vector” on the outer collapsed loops with the caveat that the memory usage increases due to the extra private copies of temp2? In this case, it appears to be about 20% slower.