I have a two CUDA-compatible cards GTX970 and GTX465 both in PCI-E x16 2.0, and I have a strange results for the same code ran on the different GPUs (preselected by cudaSetDevice(devId)) on the same PC
Geforce GTX970:
CUDA_Avg 377.785533 ms
==4840== Profiling application: cudaTests.exe 10 0
==4840== Profiling result:
Time(%) Time Calls Avg Min Max Name
69.14% 3.8002ms 11 345.47us 273.11us 1.0596ms [CUDA memcpy HtoD]
17.60% 967.24us 1 967.24us 967.24us 967.24us [CUDA memcpy DtoH]
13.27% 729.14us 10 72.914us 72.191us 73.918us vecAvg(unsigned char*, float*, unsigned int, float)
==4840== API calls:
Time(%) Time Calls Avg Min Max Name
95.74% 365.19ms 1 365.19ms 365.19ms 365.19ms cudaMallocHost
2.69% 10.254ms 12 854.52us 348.57us 1.4642ms cudaMemcpy
0.61% 2.3342ms 3 778.05us 470.58us 1.3798ms cudaGetDeviceProperties
0.37% 1.4015ms 166 8.4420us 0ns 460.34us cuDeviceGetAttribute
0.33% 1.2676ms 2 633.77us 523.06us 744.49us cudaMalloc
0.12% 457.36us 3 152.45us 4.2670us 229.11us cudaFree
0.06% 242.76us 10 24.275us 23.038us 33.705us cudaLaunch
0.06% 216.31us 2 108.15us 101.54us 114.77us cuDeviceGetName
0.01% 23.891us 1 23.891us 23.891us 23.891us cudaSetDevice
0.00% 13.226us 2 6.6130us 5.9730us 7.2530us cuDeviceTotalMem
0.00% 11.947us 40 298ns 0ns 854ns cudaSetupArgument
0.00% 8.1050us 10 810ns 426ns 3.4130us cudaConfigureCall
0.00% 3.4130us 2 1.7060us 1.7060us 1.7070us cudaDeviceGetAttribute
0.00% 2.9860us 1 2.9860us 2.9860us 2.9860us cudaGetDeviceCount
0.00% 1.2810us 4 320ns 0ns 427ns cuDeviceGet
0.00% 1.2800us 2 640ns 0ns 1.2800us uDeviceGetCount
Geforce GTX465:
CUDA_Avg 170.713206 ms
==4464== Profiling application: cudaTests.exe 10 1
==4464== Profiling result:
Time(%) Time Calls Avg Min Max Name
53.83% 3.0110ms 11 273.73us 209.44us 877.19us [CUDA memcpy HtoD]
32.83% 1.8364ms 10 183.64us 183.07us 184.68us vecAvg(unsigned char*, float*, unsigned int, float)
13.34% 745.96us 1 745.96us 745.96us 745.96us [CUDA memcpy DtoH]
==4464== API calls:
Time(%) Time Calls Avg Min Max Name
89.46% 156.68ms 1 156.68ms 156.68ms 156.68ms cudaMallocHost
6.37% 11.162ms 12 930.18us 337.05us 1.2953ms cudaMemcpy
1.61% 2.8167ms 3 938.89us 469.31us 1.2902ms cudaGetDeviceProperties
1.02% 1.7804ms 2 890.19us 666.41us 1.1140ms cudaMalloc
0.88% 1.5406ms 166 9.2800us 0ns 514.10us cuDeviceGetAttribute
0.20% 352.41us 3 117.47us 3.4130us 191.99us cudaFree
0.18% 308.03us 1 308.03us 308.03us 308.03us cudaSetDevice
0.14% 240.62us 10 24.062us 18.772us 45.650us cudaLaunch
0.12% 212.47us 2 106.23us 99.834us 112.63us cuDeviceGetName
0.01% 13.226us 40 330ns 0ns 1.2800us cudaSetupArgument
0.01% 12.373us 2 6.1860us 5.9730us 6.4000us cuDeviceTotalMem
0.01% 11.092us 10 1.1090us 426ns 5.5470us cudaConfigureCall
0.00% 4.6930us 1 4.6930us 4.6930us 4.6930us cudaGetDeviceCount
0.00% 3.4130us 2 1.7060us 1.7060us 1.7070us cudaDeviceGetAttribute
0.00% 1.7080us 4 427ns 427ns 427ns cuDeviceGet
0.00% 853ns 2 426ns 0ns 853ns cuDeviceGetCount
as you can see, GTX970 can do kernel in x2 times faster than old GTX465 GPU, but memory allocation (ok, first run cudaMalloc usually takes long) and memory transfer HtoD and DtoH is much slower for GTX970, any one have any ideas why? Thank you in advance.
PS: Each memory copy is 1280960sizeof(float)
GPU’s:
CUDA device [GeForce GTX 970] has
13 Multi-Processors
Compute 5.2
MP cnt: 13
SM: 13
Concurrent Kernels:1
AsyncEngineCount:2
CUDA device [GeForce GTX 465] has
11 Multi-Processors
Compute 2.0
MP cnt: 11
SM: 11
Concurrent Kernels:1
AsyncEngineCount:1