CPU load affects CUDA performance, how to optimize?

Dear all,

I am using CUDA for simulation, since I have 16 AMD cores and one K20 card (ORNL Titan), I want to make the best use of CPU and GPU. So, I divide my problem for CPU and GPU, then I use OpenMP to create 16 threads, one thread for CUDA(memcpy, kernel invoke…), and the other 15 for cpu computing.

But finally I found it takes longer time than GPU-only.

----1
I measured the time on GPU related work, memcpy takes longer time when with openmp thread (from 484s - 752s, 600 iterations). so why?

my decomposition (about 1:10) for CPU and GPU is balanced, they take quite similar time.

----2
To verify, if i make N cores busy in another process, and use GPU-only for simulation, it also affects my simulation time. more or less the more busy cores (maximum 15, 1 left for CUDA), the longer time takes on simulation.

Could someone help me? Great thanks!

You may be running into CPU memory bandwidth limits, which may have 2 considerations

  1. The aggregate bandwidth to main memory on an interlagos processor is somewhere between 25 and 50GB/s I believe. So if you have a single thread that is trying to do cudaMemcpy operations to a GPU connected over PCIE Gen2, then no problem, as you only need ~6GB/s of bandwidth. But if you have 15 other threads that are also competing for main memory bandwidth, this will likely affect the bandwidth available for the cudaMemcpy operation, causing it to take longer.

  2. I believe there are 2 NUMA domains on a single Interlagos processor. If you have memory associated with one NUMA domain that is being accessed by a thread in another domain, this will have implications for available bandwidth and latency.

What txbob said. If you don’t already use it, consider using numactl to fine tune CPU affinity and memory affinity for your threads.

My initial reaction was “get a faster host system”, but seeing now that these are Titan nodes, that’s not a realistic choice, I guess. Unfortunately, Titan’s host platform was already outdated when the GPUs were added, so it is not surprising that host-side bottlenecks are more likely on Titan.

back-of-napkin arithmetic:

50GB/s / 16 threads = 3.125 GB/s/thread

This would approximately double the host<->device transfer times, as compared to having 6GB/s or more available for the thread servicing the PCIE Gen2 link.

And the situation gets worse if you are overlapping host->device and device->host transfers, as PCIE is a bidirectional link and has approximately double the bandwidth in that mode (so, 12GB/s) whereas the main memory bandwidth (at, say, 50GB/s) is already an aggregate number.

So I think it’s not unreasonable that a fully loaded interlagos processor will behave according to what you are witnessing.

Great thanks to @txbob and @njuffa! Thank you very much for your time!

Yes, I do agree with your ideas on main memory bandwidth. But I think we should not just divide by #of thread. Actually in my second test (----2), I made some CPU cores busy solely with while(1) in OpenMP thread, i.e., they do not need to access to main memory. That is, only the cuda related thread needs to access to the main memory.

One more test:

Finally I tried to use pinned memory, the situation becomes better.

So, based on these considerations, what i want to figure out is my second test (----2).

Great thanks!

I am not familiar with the CPU you use. You also don’t state how much of a slowdown you are seeing.

I use Intel CPUs that use dynamic clock boosting: the more threads I use, the slower the operating clock of the CPU. To run at the maximum supported clock frequency, I must limit to a single active thread for the entire CPU. For applications that use considerable amounts of CPU processing in addition to the GPU processing (certain Folding@Home tasks are of that nature, for example), the lowered CPU clock negatively affects application run time. On Intel CPUs with support for AVX2, this effect could further be exacerbated since use of AVX2 may cause the CPU frequency to be lowered even further.

[Later:] Interlagos CPUs do seem to have a dynamic clock frequency adjustment feature based on an online article I found (http://www.xbitlabs.com/news/cpu/display/20110722140436_AMD_Plans_to_Begin_Shipments_of_16_Core_Opteron_Interlagos_Chips_in_August.html). Emphasis mine:

“AMD Opteron 6200-series central processing units (CPUs) code-named Interlagos will have twelve or sixteen cores based on Bulldozer micro-architecture be drop-in compatible with existing G34 multi-socket server platforms and will bring a number of enhancements. In particular, the new microprocessors will sport a new memory controller with higher bandwidth, dynamic overclocking technology and some other improvements.”

Thank you very much @njuffa.

Ohhhhh, I see, the dynamic overclocking technology should be the culprit. I will monitoring the cpu clock.

Great thanks for your continually help.