NPP functions failure with multiple devices

In the following code, the second call to nppiSet_32s_C1R always fails with NPP_CUDA_KERNEL_EXECUTION_ERROR.


int iDeviceCount = 0;
checkCudaErrors(cudaGetDeviceCount(&iDeviceCount));

const int iWidth = 1000;
const int iHeight = 1;
NppiSize stBufferROI = { iWidth, iHeight };

Npp32s *pBuffer = 0;
int iBufferStep = 0;

for (int iDevice = 0; iDevice < iDeviceCount; iDevice++) {
    checkCudaErrors(cudaSetDevice(iDevice));
    pBuffer = nppiMalloc_32s_C1(iWidth, iHeight, &iBufferStep);
    NPP_CHECK_NPP(nppiSet_32s_C1R(0, pBuffer, iBufferStep, stBufferROI));
    nppiFree(pBuffer);
}

Also it always fails if iBufferStep % 64 == 0 and iWidth % 16 != 0.
On each GPU separately, the code works fine.
If the second iteration of the loop is done in a new thread, everything works fine too.

I have 2 GPUs: RTX 2070 and GTX 1660 Ti.
Tested on driver versions 456.43 and 462.31, CUDA versions 10.1 and 11.1.

I suggest filing a bug. The instructions are linked to a sticky post at the top of this sub-forum