I have an optimized piece of code under Windows, using
#pragma acc kernels copyin(A[0:size],B[0:size],D[0:size],E[0:size],F[0:size],G[0:size]) pcopyout(C[0:size])
I build with
pgcc -acc -tp=p7 -m64 -ta=tesla:nollvm,nordc -Minfo=accel -Mlargeaddressaware
When I run on different widows systems and examine under the performance monitor, I see a different number of threads being created for the process which is running the GPU code.
On dual xeon E5-2660 v2, I see it create 4 threads per process which is running the GPU code. Windows 7 x64, Quadro M4000.
On dual xeon E5520, I see it create 3 threads per process. Win 7 x64, Quadro 4000.
On i7-5960x, I see it create 7 threads per process. Win 10 x64, Quadro M4000.
On the i7 and E5520 systems, I see that I push the GPU to about 99% utilization with 4 processes (all running same code). With the E5-2660, this requires 7 - 8 processes.
So I am trying to determine if the runtime thread decisions for the code can be set in some way - ACC or CUDA command - to see if I can get the code to better use the GPU on the dual-xeons systems.