K10 Host <--> Device bandwidth seem low


Using the bandwidth test in the SDK our K10’s are showing less than have the host to/from device transfer rates as the 670 in my desktop. It’s variable because of other processes on the shared server but I’ve never seen numbers higher than about 3GB/s whereas the 670 clocks in at about 8GB/s consistently (but I control what’s running on that system).

The server has 32 Cores hyper threaded Intel® Xeon® CPU E5-4650 0 @ 2.70GHz, while the desktop has 4 core hyper threaded Intel® Core™ i7-3770K CPU @ 3.50GHz. Both have PCIe 2 buss with (I think) 16x slots and 1600Mhz RAM.

My question is are these the kind of transfer rates we should expect from the K10 or do we have some sort of configuration problem on the server. BIOS upgrades were done yesterday (for a different reason) with no effect.

Also the server has 2 cards on 2 busses controlled by different cpu chips with the same numbers.

Our computations are of course very dependent on host to device bandwidth.

If this is normal fine.


I assume you are measuring PCIe transfer rates with pinned host memory? I am not familiar with this specific setup (I have never used a K10), but in general for NUMA server configurations to achieve optimal transfer speeds you would want to carefully control both CPU and memory affinity to avoid transfers to the “far” CPU or memory. Under Linux, one relevant tool is numactl.

Thanks njuffa

I have done experiments with cpu affinity with minor effects. I have managed to cajole other users off this system long enough to run my tests.

But that is a good one to put on the list for when I convince the boss to take the system off the network and swap the K10 and 670 so we can pin point if it is the other tasks, GPU, system or combo that limits these transfers.

Also for the record both systems are running Scientific Linux 6.3 (a RedHat derivative)


RedHat provides some general NUMA tuning information in their performance tuning guide that you may find useful background reading:

Thanks again njuffa, I haven’t seen that yet.

Almost everyone involved is on their way to a conference on the east coast, we’re on the west. I’ll report back if we find something that works.


Hi Joe,
I just joined this forum today, but did run into and solved a similar performance issue.
I’m using 4 K10 boards, and on a dual 8-core Intel PC use a separate host thread for each of the 8 GPUs, doing transfers simultaneously. I experienced speed fluctuations, alternating randomly between 450 MB/s and 1.1 GB/sec on each of the 8 threads. Until I did the following, after noticing on the performance monitor that the fast speed occurred only when the assigned 8 CPUs were not consecutive (on screen).

I call SetProcessAffinityMask() with a carefully built mask. In this case for 8 threads the computed mask is 0x5555 to specify the right, i.e. fast, set of CPUs. This always works and the DMA rates are now always the 1GB/s ones. I’m not sure, but I think his specifies 8 cores on the same Intel chip, rather than some being on the second chip.