nppiCFAToRGB_8u_C1C3R_Ctx Blocking HtoD Copy

As can be seen below, nppiCFAToRGB_8u_C1C3R_Ctx function (purple blobs) blocks, following stream’s MemcpyHtoD. How can I maximize the overlap between those two operations? Also, shouldn’t kernel execution engine and copy engine work indepently from each other by default. Using CUDA 11.1 and created the streams as ;

cudaStream_t stream_;
cudaStreamCreateWithFlags(&stream_, cudaStreamNonBlocking);
nppStreamCtx.hStream = stream_;
cudaError = cudaStreamGetFlags(nppStreamCtx.hStream, &nppStreamCtx.nStreamFlags);