Problem when using npp median filter with application managed cuda stream

similar to implementation in OpenCV Cuda version, I am using median filters with 5 cuda streams.

I have an array of buffer for scarach buffer and an array of NppStreamContext
d_buffer = new Npp8u * [streamSize_];
nppStreamCtxs = new NppStreamContext[streamSize_];

each stream called nppiFilterMedianGetBufferSize_8u_C1R_Ctx(oSizeROI, oMaskSize, &bufferSize,
nppStreamCtxs[i]); to get buffer size, and use
cudaStreamSynchronize(nppStreamCtxs[i].hStream);
cudaMalloc((void**)&(d_buffer[i]), bufferSize);
to allocate buffer

I made the boarder.
and called
nppiFilterMedian_8u_C1R_Ctx(srcRoi.ptr(), static_cast(srcRoi.step),
dst.ptr(), static_cast(dst.step),
oSizeROI, oMaskSize, oAnchor, d_buffer[streamIdx], nppStreamCtxs[streamIdx]);
to call the api asynchronize

the result is correct but when I use the Nsight compute to profile each stream, the command
void FilterMedianKernelSortingNetworkShared::RunKernel<float, (int)1, (int)1, (int)7>(Pixel<T1, T2> *, int,
NppiSize, NppiSize, const Pixel<T1, T2> *, int, int)
always trigger a cudaDeviceSynchronize implicitly

I tried various way of NppStreamCtx value, all not working, currently using

cudaStream_t cStream = cv::cuda::StreamAccessor::getStream(streams[i]);
cudaStream_t nppStream = nppGetStream();
if (cStream != nppStream) {
nppSetStream(cStream);
nppGetStreamContext(&nppStreamCtxs[i]);
}

for box filter, nppiFilterBox_8u_C1R_Ctx, this way works, but not work in median filter
NppStreamContext nppStreamCtx;
nppSafeCall(nppGetStreamContext(&nppStreamCtx));
nppStreamCtx.hStream = StreamAccessor::getStream(_stream);

CUDA VERSION 11.7
OPENCV VERSION 4.9.0
GPU A6000
Please help