Possible bug in nppiWarpPerspectiveBack_8u_C1R and nppiWarpPerspectiveBack_32f_C1R?

I’m seeing an odd behavior. On my primary dev box, a linux machine with a Titan X GPU, I receive expected behavior with nppiWarpPerspectiveBack_X_C1R. On a Windows 10 machine with a Tesla Titan V, I see a different behavior. nppiWarpPerspective returns an all 0 buffer without throwing an error code. On that same machine, nppiWarpAffine_8u_C1R and nppiWarpAffine_32f_C1R behave as expected. So, it seems like I’m seeing machine/GPU specific behavior only with nppiWarpPerspectiveBack. Can someone confirm if this is a known issue?

template <typename PIXEL_T, typename NPP_FWARP, typename NPP_FSET>
void callNPP(const PIXEL_T* pSrc,
             const std::vector<size_t> srcSize,
             NppiRect &srcRoi,
             PIXEL_T* pDst,
             NppiRect &dstRoi,
             NppiSize &dstRoiSize,
             const double *tformCoeffs,
             const double *fillVal,
             const std::string &interpolation,
             size_t numelSrc,
             const std::vector<size_t> dstSize,
             NPP_FWARP nppWarpFuncPtr,
             NPP_FSET nppSetFuncPtr)

    NppiSize sizeSrc = { static_cast<int>(srcSize[0]), static_cast<int>(srcSize[1]) };

    int dstStep = static_cast<int>(dstSize[0]) * sizeof(PIXEL_T);
    int srcStep = static_cast<int>(srcSize[0]) * sizeof(PIXEL_T);

    size_t inPlanes = (numelSrc == 0) ?
            0 : numelSrc / srcSize[0] / srcSize[1];

    mwSize dstSizePerPlane = dstSize[0]*dstSize[1];

    int interpMethod = imagesgpu::getInterpEnumFromString(interpolation);

    double T[3][3];        
    NppStatus statusCode;

    for (mwSize k = 0; k < inPlanes; k++) 
        // Initialize plane with fill value. Dst pixels that map out of bounds in src will not be touched and therefore will have this initialized value.
        statusCode = (*nppSetFuncPtr)(static_cast<PIXEL_T>(fillVal[k]),pDst+k*dstSizePerPlane, dstStep, dstRoiSize);

        if (statusCode != NPP_SUCCESS)        

        statusCode = (*nppWarpFuncPtr)(
            (pSrc + k*srcSize[0]*srcSize[1]),
            (pDst +k*dstSize[0]*dstSize[1]),
        if (statusCode != NPP_SUCCESS)        


CUDADevice with properties:

                  Name: 'TITAN V'
                 Index: 1
     ComputeCapability: '7.0'
        SupportsDouble: 1
         DriverVersion: 10.1000
        ToolkitVersion: 10.1000
    MaxThreadsPerBlock: 1024
      MaxShmemPerBlock: 49152
    MaxThreadBlockSize: [1024 1024 64]
           MaxGridSize: [2.1475e+09 65535 65535]
             SIMDWidth: 32
           TotalMemory: 1.2747e+10
       AvailableMemory: 1.1943e+10
   MultiprocessorCount: 80
          ClockRateKHz: 1455000
           ComputeMode: 'Default'
  GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
      CanMapHostMemory: 1
       DeviceSupported: 1
        DeviceSelected: 1

Two suggestions:

  1. When posting code here, use the code formatting that is available. In the edit box, there is a button at the top right that looks like this: </> If you select your code, then press that button, you will get code formatting, which makes it easier to read.

  2. In my experience, if you think you have found a bug, it’s usually a good idea to post a complete reproducer code. In my experience, it makes your post more likely to be responded to. If you file a bug, you will definitely be asked for that. I wouldn’t expect any difference in behavior for the same code operating on the same data running on TitanX/linux vs. TeslaP100/Windows. I doubt it is a known issue. In addition to the NPP error checking, you probably should also use proper CUDA error checking for any CUDA calls you may be using, and it’s also a good idea to run your code with cuda-memcheck in the failing case. Also make sure your TeslaP100 is in a proper server configuration, where the server has been certified by the OEM for use of the TeslaP100. Tesla cards generally require flow-through cooling and can be flaky or error-prone if not in a proper server setup.