GTX970 vs GTX465 comparison on the same PC (CUDA 7.0)

I have a two CUDA-compatible cards GTX970 and GTX465 both in PCI-E x16 2.0, and I have a strange results for the same code ran on the different GPUs (preselected by cudaSetDevice(devId)) on the same PC

Geforce GTX970:

CUDA_Avg 377.785533 ms
==4840== Profiling application: cudaTests.exe 10 0
==4840== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 69.14%  3.8002ms        11  345.47us  273.11us  1.0596ms  [CUDA memcpy HtoD]
 17.60%  967.24us         1  967.24us  967.24us  967.24us  [CUDA memcpy DtoH]
 13.27%  729.14us        10  72.914us  72.191us  73.918us  vecAvg(unsigned char*, float*, unsigned int, float)

==4840== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 95.74%  365.19ms         1  365.19ms  365.19ms  365.19ms  cudaMallocHost
  2.69%  10.254ms        12  854.52us  348.57us  1.4642ms  cudaMemcpy
  0.61%  2.3342ms         3  778.05us  470.58us  1.3798ms  cudaGetDeviceProperties
  0.37%  1.4015ms       166  8.4420us       0ns  460.34us  cuDeviceGetAttribute
  0.33%  1.2676ms         2  633.77us  523.06us  744.49us  cudaMalloc
  0.12%  457.36us         3  152.45us  4.2670us  229.11us  cudaFree
  0.06%  242.76us        10  24.275us  23.038us  33.705us  cudaLaunch
  0.06%  216.31us         2  108.15us  101.54us  114.77us  cuDeviceGetName
  0.01%  23.891us         1  23.891us  23.891us  23.891us  cudaSetDevice
  0.00%  13.226us         2  6.6130us  5.9730us  7.2530us  cuDeviceTotalMem
  0.00%  11.947us        40     298ns       0ns     854ns  cudaSetupArgument
  0.00%  8.1050us        10     810ns     426ns  3.4130us  cudaConfigureCall
  0.00%  3.4130us         2  1.7060us  1.7060us  1.7070us  cudaDeviceGetAttribute
  0.00%  2.9860us         1  2.9860us  2.9860us  2.9860us  cudaGetDeviceCount
  0.00%  1.2810us         4     320ns       0ns     427ns  cuDeviceGet
  0.00%  1.2800us         2     640ns       0ns  1.2800us  uDeviceGetCount

Geforce GTX465:

CUDA_Avg 170.713206 ms
==4464== Profiling application: cudaTests.exe 10 1
==4464== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 53.83%  3.0110ms        11  273.73us  209.44us  877.19us  [CUDA memcpy HtoD]
 32.83%  1.8364ms        10  183.64us  183.07us  184.68us  vecAvg(unsigned char*, float*, unsigned int, float)
 13.34%  745.96us         1  745.96us  745.96us  745.96us  [CUDA memcpy DtoH]

==4464== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 89.46%  156.68ms         1  156.68ms  156.68ms  156.68ms  cudaMallocHost
  6.37%  11.162ms        12  930.18us  337.05us  1.2953ms  cudaMemcpy
  1.61%  2.8167ms         3  938.89us  469.31us  1.2902ms  cudaGetDeviceProperties
  1.02%  1.7804ms         2  890.19us  666.41us  1.1140ms  cudaMalloc
  0.88%  1.5406ms       166  9.2800us       0ns  514.10us  cuDeviceGetAttribute
  0.20%  352.41us         3  117.47us  3.4130us  191.99us  cudaFree
  0.18%  308.03us         1  308.03us  308.03us  308.03us  cudaSetDevice
  0.14%  240.62us        10  24.062us  18.772us  45.650us  cudaLaunch
  0.12%  212.47us         2  106.23us  99.834us  112.63us  cuDeviceGetName
  0.01%  13.226us        40     330ns       0ns  1.2800us  cudaSetupArgument
  0.01%  12.373us         2  6.1860us  5.9730us  6.4000us  cuDeviceTotalMem
  0.01%  11.092us        10  1.1090us     426ns  5.5470us  cudaConfigureCall
  0.00%  4.6930us         1  4.6930us  4.6930us  4.6930us  cudaGetDeviceCount
  0.00%  3.4130us         2  1.7060us  1.7060us  1.7070us  cudaDeviceGetAttribute
  0.00%  1.7080us         4     427ns     427ns     427ns  cuDeviceGet
  0.00%     853ns         2     426ns       0ns     853ns  cuDeviceGetCount

as you can see, GTX970 can do kernel in x2 times faster than old GTX465 GPU, but memory allocation (ok, first run cudaMalloc usually takes long) and memory transfer HtoD and DtoH is much slower for GTX970, any one have any ideas why? Thank you in advance.

PS: Each memory copy is 1280960sizeof(float)

GPU’s:

CUDA device [GeForce GTX 970] has
        13 Multi-Processors
        Compute 5.2
        MP cnt: 13
        SM: 13
        Concurrent Kernels:1
        AsyncEngineCount:2
CUDA device [GeForce GTX 465] has
        11 Multi-Processors
        Compute 2.0
        MP cnt: 11
        SM: 11
        Concurrent Kernels:1
        AsyncEngineCount:1

Are you running a controlled experiment, where you swap just the GPU in the system, and otherwise use exactly the same hardware and software? If not, I would strongly recommend running such a controlled experiment first, to avoid chasing a red herring.

In case these two GPUs are in the same system simultaneously, are both PCIe slots in fact configured as x16? Also for this scenario, what is the CPU in the system? Does it provide enough PCIe lanes to drive both GPUs at x16? This requires a CPU with at least 32 PCIe lanes, obviously.

Is this a multi-socket system by any chance? If so, make sure to use CPU and memory affinity settings such that each GPU “talks” to the “near” CPU and memory.

What is the D2H, H2D throughput reported for the two GPUs by the CUDA sample app bandwidthTest?

The system based on Intel Dual Xeon 2.40GHz(E5620) with 56GB DDR3 (Tripple channel) @ Supermicro X8DAH.
I’ve switched GPU in software running app with different param and selecting GPU’s via

cudaSetDevice( id )

So each GPU do work separately in time. All GPU’s installed in PCI-Ex16 (double checked), also tried to swap GPU’s slots - same results.

bandwidthTest.exe for GeForce GTX 970:

w:\>bandwidthTest.exe --device 0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 970
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6033.9

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6516.4

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     137663.1

Result = PASS

bandwidthTest.exe for GeForce GTX 465:

w:\>bandwidthTest.exe --device 1
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 1: GeForce GTX 465
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     5496.8

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     5957.2

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     87572.6

Result = PASS

Thank you.

In a multi-socket NUMA system, the PCIE bandwidth may be affected by process/socket/GPU affinity. Unless you use numa control techniques such as taskset or numactl to regularize behavior, the results can change from run to run or app to app, depending on which socket the logical core exists on that the OS launches your process on.

The results from the bandwidthtest app look as expected, indicating PCIe gen2 x16 performance for both cards. As both txbob and I have pointed out, make sure to use proper CPU and memory affinity settings to get consistently good performance, since multi-socket systems exhibit NUMA behavior.