nppiCopy_16s_C1R_Ctx in side threads causes memory leak?

Hello,

I am trying to use nppiCopy_16s_C1R_Ctx() on Windows10,
and I am seeing memory leak if I call it in side threads.

Pseudo Code:

Main()
    for (int i=0; i < 1000; i++) {
      cudaMallocPitch()       // Allocate src buffer
      cudaMallocPitch()       // Allocate dest buffer

      cudaStreamCreate()      // for _Ctx call [A]
      Setup_NppStream         // Setup NppStreamContext with stream [A]

      createThreadNppiCopy()  // Create a thread which calls nppiCopy_16s_C1R_Ctx()

      cudaStreamDestroy()
      cudaFree()              // Deallocate src buffer
      cudaFree()              // Deallocate dest buffer
    }

createThreadNppiCopy()
    CreateThread() ==> threadNppiCopy()
    WaitForSingleObject()     // Wait for completion of threadNppiCopy()
    CloseHandle()             // Close the handle created by CreateThread()

threadNppiCopy()
    callNppiCopy()

callNppiCopy()
    nppiCopy_16s_C1R_Ctx()    // Use stream [A]
    cudaStreamSynchronize()   // Wait stream [A]

If I call callNppiCopy() from the main thread instead of createThreadNppiCopy(), I don’t see memory leak.

Does NPP support calling from side threads?

My Environment:

  • OS: Windows10 Pro 1903 (ja)
  • CUDA: 10.1 update2
  • NPP: 10.2.0
  • NVIDIA Graphics Driver: 431.70
  • GPU: Quadro RTX 4000 (TTC mode)
  • Compiler: VS2013

Thanks,
naoy4w