Performance problem with nppiFilter_32f_C2R

I used this function on my laptop (GeForce 940MX) a lot without any issues. Then I migrated the program to a computer with a TITAN X (Pascal) and I see a performance degradation. While the actual kernel is of course running much faster, it seems that a lot of time is consumed in the function cudaGetDeviceProperties. Here are profiling outputs for the 940MX and the TITAN X running a small test script (performing 1000 5x5 float convolution on 320x240x2 images)

==9289== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 54.48%  213.80ms      2004  106.68us  1.0240us  318.68us  [CUDA memcpy HtoD]
 35.31%  138.55ms      1001  138.41us  4.9280us  141.50us  void ForEachPixelNaive<float, int=2, Filter32fReplicateBorderSharedFunctor<float, float, int=2, int=2, float, FilterSharedBlockManager<float, int=2>>>(Image<float, int=2>, NppiSize, float)
 10.20%  40.040ms      1002  39.959us  1.5680us  44.255us  fill
  0.00%  11.616us         3  3.8720us  2.4000us  4.6400us  [CUDA memcpy DtoH]
  0.00%  4.6720us         1  4.6720us  4.6720us  4.6720us  void ForEachPixelNaive<float, int=1, Filter32fReplicateBorderSharedFunctor<float, float, int=1, int=1, float, FilterSharedBlockManager<float, int=1>>>(Image<float, int=1>, NppiSize, float)

==9289== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 22.78%  505.49ms      3006  168.16us  6.9500us  507.39us  cuMemFree
 22.40%  497.09ms      3006  165.36us  6.3920us  574.40us  cuMemAlloc
 20.25%  449.39ms      1002  448.50us  10.254us  436.70ms  cudaLaunch
 15.91%  353.14ms      1002  352.43us  321.30us  700.55us  cudaGetDeviceProperties
  9.86%  218.81ms         1  218.81ms  218.81ms  218.81ms  cuCtxDetach
  5.10%  113.19ms         1  113.19ms  113.19ms  113.19ms  cuCtxCreate
  2.66%  59.082ms      2004  29.481us  8.2960us  275.36us  cuMemcpyHtoD
  0.62%  13.858ms      1002  13.830us  10.070us  50.107us  cuLaunchKernel
  0.10%  2.3026ms      2004  1.1490us     460ns  345.86us  cudaGetDevice
  0.05%  1.1575ms      3010     384ns     227ns  13.820us  cuCtxGetDevice
  0.05%  1.1356ms      2004     566ns     317ns  4.1880us  cudaDeviceGetAttribute
  0.04%  843.63us      2004     420ns     134ns  15.811us  cudaGetDeviceCount
  0.04%  821.32us       203  4.0450us     168ns  182.99us  cuDeviceGetAttribute
  0.03%  766.64us      1002     765ns     376ns  14.817us  cudaConfigureCall
  0.03%  670.23us      1002     668ns     471ns  3.9990us  cuFuncSetBlockShape
  0.03%  626.08us      3006     208ns     130ns  5.3460us  cudaSetupArgument
  0.01%  264.48us      1002     263ns     196ns  13.561us  cudaGetLastError
  0.01%  231.46us         2  115.73us  108.50us  122.96us  cuDeviceTotalMem
  0.01%  153.49us         1  153.49us  153.49us  153.49us  cuModuleLoadDataEx
  0.00%  106.74us         2  53.372us  41.294us  65.450us  cuDeviceGetName
  0.00%  70.387us         2  35.193us  27.081us  43.306us  cuMemcpy2D
  0.00%  55.764us         1  55.764us  55.764us  55.764us  cuModuleUnload
  0.00%  25.024us         1  25.024us  25.024us  25.024us  cuMemcpyDtoH
  0.00%  6.7230us         5  1.3440us     296ns  4.0430us  cuDeviceGetCount
  0.00%  6.6780us         8     834ns     231ns  2.7970us  cuCtxPushCurrent
  0.00%  4.9090us         5     981ns     240ns  1.7660us  cuDeviceGet
  0.00%  3.7380us         8     467ns     204ns  1.0060us  cuCtxPopCurrent
  0.00%  3.4850us        16     217ns     157ns     440ns  cuDeviceComputeCapability
  0.00%  1.1870us         2     593ns     440ns     747ns  cuInit
  0.00%     806ns         1     806ns     806ns     806ns  cuModuleGetFunction
  0.00%     729ns         2     364ns     255ns     474ns  cuDriverGetVersion

The same code run on the TITAN X:

==26042== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 50.39%  54.609ms      2004  27.250us     960ns  65.249us  [CUDA memcpy HtoD]
 47.22%  51.173ms      1001  51.122us  2.6890us  51.617us  void ForEachPixelNaive<float, int=2, Filter32fReplicateBorderSharedFunctor<float, float, int=2, int=2, float, FilterSharedBlockManager<float, int=2>>>(Image<float, int=2>, NppiSize, float)
  2.38%  2.5739ms      1002  2.5680us  1.4400us  2.7840us  fill
  0.00%  4.2560us         3  1.4180us  1.2800us  1.5680us  [CUDA memcpy DtoH]
  0.00%  2.8480us         1  2.8480us  2.8480us  2.8480us  void ForEachPixelNaive<float, int=1, Filter32fReplicateBorderSharedFunctor<float, float, int=1, int=1, float, FilterSharedBlockManager<float, int=1>>>(Image<float, int=1>, NppiSize, float)

==26042== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 45.03%  1.37474s      1002  1.3720ms  1.2603ms  9.5501ms  cudaGetDeviceProperties
 22.79%  695.85ms      1002  694.46us  16.596us  677.55ms  cudaLaunch
 12.72%  388.27ms         1  388.27ms  388.27ms  388.27ms  cuCtxDetach
 12.67%  386.72ms         1  386.72ms  386.72ms  386.72ms  cuCtxCreate
  2.44%  74.558ms      3006  24.802us  5.5920us  559.68us  cuMemFree
  2.38%  72.511ms      2004  36.183us  11.000us  164.77us  cuMemcpyHtoD
  0.98%  29.948ms      3006  9.9620us  2.9610us  1.2263ms  cuMemAlloc
  0.55%  16.643ms      1002  16.610us  15.356us  62.671us  cuLaunchKernel
  0.10%  3.1094ms       202  15.392us     218ns  695.59us  cuDeviceGetAttribute
  0.06%  1.9541ms      2004     975ns     455ns  451.56us  cudaGetDevice
  0.06%  1.7512ms         2  875.58us  857.32us  893.84us  cuDeviceTotalMem
  0.04%  1.1267ms      2004     562ns     281ns  15.781us  cudaDeviceGetAttribute
  0.03%  914.36us      3006     304ns     144ns  14.809us  cudaSetupArgument
  0.03%  905.53us      2004     451ns     130ns  11.326us  cudaGetDeviceCount
  0.03%  890.14us      3010     295ns     168ns  8.1980us  cuCtxGetDevice
  0.02%  758.98us      1002     757ns     572ns  8.9910us  cudaConfigureCall
  0.02%  735.69us      1002     734ns     446ns  10.691us  cuFuncSetBlockShape
  0.01%  437.48us         2  218.74us  145.12us  292.36us  cuDeviceGetName
  0.01%  342.20us      1002     341ns     278ns  4.8230us  cudaGetLastError
  0.01%  316.49us         1  316.49us  316.49us  316.49us  cuModuleLoadDataEx
  0.00%  146.12us         1  146.12us  146.12us  146.12us  cuModuleUnload
  0.00%  56.660us         2  28.330us  27.471us  29.189us  cuMemcpy2D
  0.00%  42.767us         1  42.767us  42.767us  42.767us  cuMemcpyDtoH
  0.00%  5.6830us        16     355ns     195ns     840ns  cuDeviceComputeCapability
  0.00%  5.0090us         8     626ns     208ns  2.4440us  cuCtxPushCurrent
  0.00%  4.1750us         4  1.0430us     426ns  2.3830us  cuDeviceGetCount
  0.00%  3.7850us         8     473ns     194ns  1.2910us  cuCtxPopCurrent
  0.00%  2.7780us         4     694ns     440ns  1.2050us  cuDeviceGet
  0.00%  2.2500us         2  1.1250us     672ns  1.5780us  cuInit
  0.00%  1.9440us         1  1.9440us  1.9440us  1.9440us  cuModuleGetFunction
  0.00%  1.6990us         2     849ns     488ns  1.2110us  cuDriverGetVersion

As you can see, the program spends over a second in the cudaGetDeviceProperties function. Interestingly, the function is only called for the 2 channel convolution function, the single channel convolution function doesn’t call it that often. Here are the system configurations of the systems:

CUDA Version: 8.0.44

==============NVSMI LOG==============

Timestamp                           : Fri Jun  9 10:46:11 2017
Driver Version                      : 375.39

Attached GPUs                       : 1
GPU 0000:02:00.0
    Product Name                    : GeForce 940MX
    Product Brand                   : GeForce
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-2b43952c-5da6-8fd7-2ee8-3abd491f7c0b
    Minor Number                    : 0
    VBIOS Version                   : 82.08.57.00.22
    MultiGPU Board                  : No
    Board ID                        : 0x200
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : N/A
        OEM Object                  : N/A
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x134D10DE
        Bus Id                      : 0000:02:00.0
        Sub System Id               : 0x505017AA
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 4x
                Current             : 4x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : N/A
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 2002 MiB
        Used                        : 348 MiB
        Free                        : 1654 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 1 MiB
        Free                        : 255 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : N/A
        Decoder                     : N/A
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 37 C
        GPU Shutdown Temp           : 99 C
        GPU Slowdown Temp           : 94 C
    Power Readings
        Power Management            : N/A
        Power Draw                  : N/A
        Power Limit                 : N/A
        Default Power Limit         : N/A
        Enforced Power Limit        : N/A
        Min Power Limit             : N/A
        Max Power Limit             : N/A
    Clocks
        Graphics                    : 135 MHz
        SM                          : 135 MHz
        Memory                      : 405 MHz
        Video                       : 405 MHz
    Applications Clocks
        Graphics                    : 1124 MHz
        Memory                      : 1001 MHz
    Default Applications Clocks
        Graphics                    : 1122 MHz
        Memory                      : 1001 MHz
    Max Clocks
        Graphics                    : 1241 MHz
        SM                          : 1241 MHz
        Memory                      : 1001 MHz
        Video                       : 1216 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 8842
            Type                    : C
            Name                    : /home/wiedemac/python3/bin/python3
            Used GPU Memory         : 17 MiB
        Process ID                  : 19297
            Type                    : C
            Name                    : /home/wiedemac/python3/bin/python
            Used GPU Memory         : 16 MiB
        Process ID                  : 23338
            Type                    : C
            Name                    : /home/wiedemac/python3/bin/python
            Used GPU Memory         : 309 MiB
CUDA Version 8.0.61

==============NVSMI LOG==============

Timestamp                           : Fri Jun  9 08:47:03 2017
Driver Version                      : 378.13

Attached GPUs                       : 1
GPU 0000:88:00.0
    Product Name                    : TITAN X (Pascal)
    Product Brand                   : GeForce
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0324416137595
    GPU UUID                        : GPU-e4f13fab-5ba5-e5a4-436b-81bbd8579018
    Minor Number                    : 3
    VBIOS Version                   : 86.02.15.00.01
    MultiGPU Board                  : No
    Board ID                        : 0x8800
    GPU Part Number                 : 900-1G611-2500-000
    Inforom Version
        Image Version               : G001.0000.01.03
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x88
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1B0010DE
        Bus Id                      : 0000:88:00.0
        Sub System Id               : 0x119A10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 23 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 12189 MiB
        Used                        : 0 MiB
        Free                        : 12189 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 2 MiB
        Free                        : 254 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 24 C
        GPU Shutdown Temp           : 96 C
        GPU Slowdown Temp           : 93 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 16.17 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 300.00 W
    Clocks
        Graphics                    : 139 MHz
        SM                          : 139 MHz
        Memory                      : 405 MHz
        Video                       : 40 MHz
    Applications Clocks
        Graphics                    : 1417 MHz
        Memory                      : 5005 MHz
    Default Applications Clocks
        Graphics                    : 1417 MHz
        Memory                      : 5005 MHz
    Max Clocks
        Graphics                    : 1911 MHz
        SM                          : 1911 MHz
        Memory                      : 5005 MHz
        Video                       : 1708 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

Is there any way around this?

Thank you

I second you’re experience:
forEachPixelNaive - as the name implies is rather inefficient.
I have written a custom kernel to do the job (based on the samples) which is much much faster.