Runtime thread creation

I have an optimized piece of code under Windows, using

#pragma acc kernels copyin(A[0:size],B[0:size],D[0:size],E[0:size],F[0:size],G[0:size]) pcopyout(C[0:size])

I build with
pgcc -acc -tp=p7 -m64 -ta=tesla:nollvm,nordc -Minfo=accel -Mlargeaddressaware

When I run on different widows systems and examine under the performance monitor, I see a different number of threads being created for the process which is running the GPU code.

On dual xeon E5-2660 v2, I see it create 4 threads per process which is running the GPU code. Windows 7 x64, Quadro M4000.

On dual xeon E5520, I see it create 3 threads per process. Win 7 x64, Quadro 4000.

On i7-5960x, I see it create 7 threads per process. Win 10 x64, Quadro M4000.

On the i7 and E5520 systems, I see that I push the GPU to about 99% utilization with 4 processes (all running same code). With the E5-2660, this requires 7 - 8 processes.

So I am trying to determine if the runtime thread decisions for the code can be set in some way - ACC or CUDA command - to see if I can get the code to better use the GPU on the dual-xeons systems.

Hi GaryT58,

This doesn’t make much sense to me. We do create a helper thread to handle asynchronous data movement but it doesn’t look like you use async and this would spawn only on thread, not a variable number.

Does the code exhibit the same behavior without OpenACC (i.e. remove “-acc -ta=tesla:nollvm,nordc”)?

Does you code contain any CPU threading via WinThreads or OpenMP?

  • Mat

Hi Mat,

The thread count was the # reported by resource monitor under CPU window.

I captured the following using process explorer:

code on i7 under Win10 without ACC enabled:

We see 3 total threads. I do not create these extras, so its Windows doing that.

code on i7 with ACC enabled:

So we see the addition of 3 nvcuda threads and 1 more windows thread.

Same ACC code running on dual xeon under Win7:

We see only 2 nvcuda threads, and none of the windows ones. I am assuming the windows threads differences are due to Win7 vs Win10. So I think these differences in windows threads can be ignored.

So curious of the difference in the # of nvcuda threads. I guess it could be a nvidia driver difference in Win7 vs Win10 driver.

I was looking in this direction as I was trying to figure out why the performance scales so much differently between the i7 system and the dual-xeon system (both have same M4000 card at PCI3.0x16).

On the i7 system I see:
1 instance running: GPU 61-62% - 114 iterations/sec
2 instances: 83-84% - 83 i/s per process (166 i/s total)
3 instances: 95-96% - 70 i/s (210 i/s total)
4 instances: 99% - 55 i/s (220 i/s total)

On the dual xeon:

4 instances: 64-77% - 30 i/s (120 i/s total)
5 instances: 78-91% - 34 i/s (155 i/s total)
6 instances: 89-97% - 31 i/s (186 i/s total)
7 instances: 97-99% - 25 i/s (175 i/s total)

So seem limited to about 30 i/s per process on the xeon, and I need to get to 6 - 7 instances to max out the GPU, rather than 4 as on the i7. I will be benchmarking the memory performance differences as well as other things to see if I can figure it out.

So, when I saw a difference in the thread counts I wondered if there was something different happening at run-time. So I think those differences can be ignored, as I focus on the platform itself.