I used this function on my laptop (GeForce 940MX) a lot without any issues. Then I migrated the program to a computer with a TITAN X (Pascal) and I see a performance degradation. While the actual kernel is of course running much faster, it seems that a lot of time is consumed in the function cudaGetDeviceProperties. Here are profiling outputs for the 940MX and the TITAN X running a small test script (performing 1000 5x5 float convolution on 320x240x2 images)
==9289== Profiling result:
Time(%) Time Calls Avg Min Max Name
54.48% 213.80ms 2004 106.68us 1.0240us 318.68us [CUDA memcpy HtoD]
35.31% 138.55ms 1001 138.41us 4.9280us 141.50us void ForEachPixelNaive<float, int=2, Filter32fReplicateBorderSharedFunctor<float, float, int=2, int=2, float, FilterSharedBlockManager<float, int=2>>>(Image<float, int=2>, NppiSize, float)
10.20% 40.040ms 1002 39.959us 1.5680us 44.255us fill
0.00% 11.616us 3 3.8720us 2.4000us 4.6400us [CUDA memcpy DtoH]
0.00% 4.6720us 1 4.6720us 4.6720us 4.6720us void ForEachPixelNaive<float, int=1, Filter32fReplicateBorderSharedFunctor<float, float, int=1, int=1, float, FilterSharedBlockManager<float, int=1>>>(Image<float, int=1>, NppiSize, float)
==9289== API calls:
Time(%) Time Calls Avg Min Max Name
22.78% 505.49ms 3006 168.16us 6.9500us 507.39us cuMemFree
22.40% 497.09ms 3006 165.36us 6.3920us 574.40us cuMemAlloc
20.25% 449.39ms 1002 448.50us 10.254us 436.70ms cudaLaunch
15.91% 353.14ms 1002 352.43us 321.30us 700.55us cudaGetDeviceProperties
9.86% 218.81ms 1 218.81ms 218.81ms 218.81ms cuCtxDetach
5.10% 113.19ms 1 113.19ms 113.19ms 113.19ms cuCtxCreate
2.66% 59.082ms 2004 29.481us 8.2960us 275.36us cuMemcpyHtoD
0.62% 13.858ms 1002 13.830us 10.070us 50.107us cuLaunchKernel
0.10% 2.3026ms 2004 1.1490us 460ns 345.86us cudaGetDevice
0.05% 1.1575ms 3010 384ns 227ns 13.820us cuCtxGetDevice
0.05% 1.1356ms 2004 566ns 317ns 4.1880us cudaDeviceGetAttribute
0.04% 843.63us 2004 420ns 134ns 15.811us cudaGetDeviceCount
0.04% 821.32us 203 4.0450us 168ns 182.99us cuDeviceGetAttribute
0.03% 766.64us 1002 765ns 376ns 14.817us cudaConfigureCall
0.03% 670.23us 1002 668ns 471ns 3.9990us cuFuncSetBlockShape
0.03% 626.08us 3006 208ns 130ns 5.3460us cudaSetupArgument
0.01% 264.48us 1002 263ns 196ns 13.561us cudaGetLastError
0.01% 231.46us 2 115.73us 108.50us 122.96us cuDeviceTotalMem
0.01% 153.49us 1 153.49us 153.49us 153.49us cuModuleLoadDataEx
0.00% 106.74us 2 53.372us 41.294us 65.450us cuDeviceGetName
0.00% 70.387us 2 35.193us 27.081us 43.306us cuMemcpy2D
0.00% 55.764us 1 55.764us 55.764us 55.764us cuModuleUnload
0.00% 25.024us 1 25.024us 25.024us 25.024us cuMemcpyDtoH
0.00% 6.7230us 5 1.3440us 296ns 4.0430us cuDeviceGetCount
0.00% 6.6780us 8 834ns 231ns 2.7970us cuCtxPushCurrent
0.00% 4.9090us 5 981ns 240ns 1.7660us cuDeviceGet
0.00% 3.7380us 8 467ns 204ns 1.0060us cuCtxPopCurrent
0.00% 3.4850us 16 217ns 157ns 440ns cuDeviceComputeCapability
0.00% 1.1870us 2 593ns 440ns 747ns cuInit
0.00% 806ns 1 806ns 806ns 806ns cuModuleGetFunction
0.00% 729ns 2 364ns 255ns 474ns cuDriverGetVersion
The same code run on the TITAN X:
==26042== Profiling result:
Time(%) Time Calls Avg Min Max Name
50.39% 54.609ms 2004 27.250us 960ns 65.249us [CUDA memcpy HtoD]
47.22% 51.173ms 1001 51.122us 2.6890us 51.617us void ForEachPixelNaive<float, int=2, Filter32fReplicateBorderSharedFunctor<float, float, int=2, int=2, float, FilterSharedBlockManager<float, int=2>>>(Image<float, int=2>, NppiSize, float)
2.38% 2.5739ms 1002 2.5680us 1.4400us 2.7840us fill
0.00% 4.2560us 3 1.4180us 1.2800us 1.5680us [CUDA memcpy DtoH]
0.00% 2.8480us 1 2.8480us 2.8480us 2.8480us void ForEachPixelNaive<float, int=1, Filter32fReplicateBorderSharedFunctor<float, float, int=1, int=1, float, FilterSharedBlockManager<float, int=1>>>(Image<float, int=1>, NppiSize, float)
==26042== API calls:
Time(%) Time Calls Avg Min Max Name
45.03% 1.37474s 1002 1.3720ms 1.2603ms 9.5501ms cudaGetDeviceProperties
22.79% 695.85ms 1002 694.46us 16.596us 677.55ms cudaLaunch
12.72% 388.27ms 1 388.27ms 388.27ms 388.27ms cuCtxDetach
12.67% 386.72ms 1 386.72ms 386.72ms 386.72ms cuCtxCreate
2.44% 74.558ms 3006 24.802us 5.5920us 559.68us cuMemFree
2.38% 72.511ms 2004 36.183us 11.000us 164.77us cuMemcpyHtoD
0.98% 29.948ms 3006 9.9620us 2.9610us 1.2263ms cuMemAlloc
0.55% 16.643ms 1002 16.610us 15.356us 62.671us cuLaunchKernel
0.10% 3.1094ms 202 15.392us 218ns 695.59us cuDeviceGetAttribute
0.06% 1.9541ms 2004 975ns 455ns 451.56us cudaGetDevice
0.06% 1.7512ms 2 875.58us 857.32us 893.84us cuDeviceTotalMem
0.04% 1.1267ms 2004 562ns 281ns 15.781us cudaDeviceGetAttribute
0.03% 914.36us 3006 304ns 144ns 14.809us cudaSetupArgument
0.03% 905.53us 2004 451ns 130ns 11.326us cudaGetDeviceCount
0.03% 890.14us 3010 295ns 168ns 8.1980us cuCtxGetDevice
0.02% 758.98us 1002 757ns 572ns 8.9910us cudaConfigureCall
0.02% 735.69us 1002 734ns 446ns 10.691us cuFuncSetBlockShape
0.01% 437.48us 2 218.74us 145.12us 292.36us cuDeviceGetName
0.01% 342.20us 1002 341ns 278ns 4.8230us cudaGetLastError
0.01% 316.49us 1 316.49us 316.49us 316.49us cuModuleLoadDataEx
0.00% 146.12us 1 146.12us 146.12us 146.12us cuModuleUnload
0.00% 56.660us 2 28.330us 27.471us 29.189us cuMemcpy2D
0.00% 42.767us 1 42.767us 42.767us 42.767us cuMemcpyDtoH
0.00% 5.6830us 16 355ns 195ns 840ns cuDeviceComputeCapability
0.00% 5.0090us 8 626ns 208ns 2.4440us cuCtxPushCurrent
0.00% 4.1750us 4 1.0430us 426ns 2.3830us cuDeviceGetCount
0.00% 3.7850us 8 473ns 194ns 1.2910us cuCtxPopCurrent
0.00% 2.7780us 4 694ns 440ns 1.2050us cuDeviceGet
0.00% 2.2500us 2 1.1250us 672ns 1.5780us cuInit
0.00% 1.9440us 1 1.9440us 1.9440us 1.9440us cuModuleGetFunction
0.00% 1.6990us 2 849ns 488ns 1.2110us cuDriverGetVersion
As you can see, the program spends over a second in the cudaGetDeviceProperties function. Interestingly, the function is only called for the 2 channel convolution function, the single channel convolution function doesn’t call it that often. Here are the system configurations of the systems:
CUDA Version: 8.0.44
==============NVSMI LOG==============
Timestamp : Fri Jun 9 10:46:11 2017
Driver Version : 375.39
Attached GPUs : 1
GPU 0000:02:00.0
Product Name : GeForce 940MX
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-2b43952c-5da6-8fd7-2ee8-3abd491f7c0b
Minor Number : 0
VBIOS Version : 82.08.57.00.22
MultiGPU Board : No
Board ID : 0x200
GPU Part Number : N/A
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x134D10DE
Bus Id : 0000:02:00.0
Sub System Id : 0x505017AA
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 4x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
Unknown : Not Active
FB Memory Usage
Total : 2002 MiB
Used : 348 MiB
Free : 1654 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 1 MiB
Free : 255 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : N/A
Decoder : N/A
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 37 C
GPU Shutdown Temp : 99 C
GPU Slowdown Temp : 94 C
Power Readings
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Default Power Limit : N/A
Enforced Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 135 MHz
SM : 135 MHz
Memory : 405 MHz
Video : 405 MHz
Applications Clocks
Graphics : 1124 MHz
Memory : 1001 MHz
Default Applications Clocks
Graphics : 1122 MHz
Memory : 1001 MHz
Max Clocks
Graphics : 1241 MHz
SM : 1241 MHz
Memory : 1001 MHz
Video : 1216 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 8842
Type : C
Name : /home/wiedemac/python3/bin/python3
Used GPU Memory : 17 MiB
Process ID : 19297
Type : C
Name : /home/wiedemac/python3/bin/python
Used GPU Memory : 16 MiB
Process ID : 23338
Type : C
Name : /home/wiedemac/python3/bin/python
Used GPU Memory : 309 MiB
CUDA Version 8.0.61
==============NVSMI LOG==============
Timestamp : Fri Jun 9 08:47:03 2017
Driver Version : 378.13
Attached GPUs : 1
GPU 0000:88:00.0
Product Name : TITAN X (Pascal)
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324416137595
GPU UUID : GPU-e4f13fab-5ba5-e5a4-436b-81bbd8579018
Minor Number : 3
VBIOS Version : 86.02.15.00.01
MultiGPU Board : No
Board ID : 0x8800
GPU Part Number : 900-1G611-2500-000
Inforom Version
Image Version : G001.0000.01.03
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x88
Device : 0x00
Domain : 0x0000
Device Id : 0x1B0010DE
Bus Id : 0000:88:00.0
Sub System Id : 0x119A10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 23 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
Unknown : Not Active
FB Memory Usage
Total : 12189 MiB
Used : 0 MiB
Free : 12189 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 24 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
Power Readings
Power Management : Supported
Power Draw : 16.17 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 139 MHz
SM : 139 MHz
Memory : 405 MHz
Video : 40 MHz
Applications Clocks
Graphics : 1417 MHz
Memory : 5005 MHz
Default Applications Clocks
Graphics : 1417 MHz
Memory : 5005 MHz
Max Clocks
Graphics : 1911 MHz
SM : 1911 MHz
Memory : 5005 MHz
Video : 1708 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
Is there any way around this?
Thank you