In the following code, the second call to nppiSet_32s_C1R always fails with NPP_CUDA_KERNEL_EXECUTION_ERROR.
int iDeviceCount = 0;
checkCudaErrors(cudaGetDeviceCount(&iDeviceCount));
const int iWidth = 1000;
const int iHeight = 1;
NppiSize stBufferROI = { iWidth, iHeight };
Npp32s *pBuffer = 0;
int iBufferStep = 0;
for (int iDevice = 0; iDevice < iDeviceCount; iDevice++) {
checkCudaErrors(cudaSetDevice(iDevice));
pBuffer = nppiMalloc_32s_C1(iWidth, iHeight, &iBufferStep);
NPP_CHECK_NPP(nppiSet_32s_C1R(0, pBuffer, iBufferStep, stBufferROI));
nppiFree(pBuffer);
}
Also it always fails if iBufferStep % 64 == 0 and iWidth % 16 != 0.
On each GPU separately, the code works fine.
If the second iteration of the loop is done in a new thread, everything works fine too.
I have 2 GPUs: RTX 2070 and GTX 1660 Ti.
Tested on driver versions 456.43 and 462.31, CUDA versions 10.1 and 11.1.