Npp with multiple Streams

I’m working on an image processing project and I decided to use streams.
I created a stream and then I used nppSetStream.

I invoked the function nppiThreshold_LTValGTVal_32f_C1R but 2 stream are used when the function is executed.

Here there’s a code example:

#include <npp.h>
#include <cuda_runtime.h>
#include <cuda_profiler_api.h>

int main(void) {
	
	int srcWidth = 1344;
	int srcHeight = 1344;
	int paddStride = 0;
	float* srcArrayDevice;
	float* srcArrayDevice2;
	unsigned char* dstArrayDevice;

	int status = cudaMalloc((void**)&srcArrayDevice, srcWidth * srcHeight * 4);
	status = cudaMalloc((void**)&srcArrayDevice2, srcWidth * srcHeight * 4);
	status = cudaMalloc((void**)&dstArrayDevice, srcWidth * srcHeight );

	cudaStream_t testStream;
	cudaStreamCreateWithFlags(&testStream, cudaStreamNonBlocking);
	nppSetStream(testStream);

	NppiSize roiSize = { srcWidth,srcHeight };
	//status = cudaMemcpyAsync(srcArrayDevice, &srcArrayHost, srcWidth*srcHeight*4, cudaMemcpyHostToDevice, testStream);

	int yRect = 100;
	int xRect = 60;
	float thrL = 50;
	float thrH = 1500;
	NppiSize sz = { 200, 400 };

	for (int i = 0; i < 10; i++) {
		int status3 = nppiThreshold_LTValGTVal_32f_C1R(srcArrayDevice + (srcWidth*yRect + xRect)
			, srcWidth * 4
			, srcArrayDevice2 + (srcWidth*yRect + xRect)
			, srcWidth * 4
			, sz
			, thrL
			, thrL
			, thrH
			, thrH);
	}

	int length = (srcWidth + paddStride)*srcHeight;
	int status6 = nppiScale_32f8u_C1R(srcArrayDevice, srcWidth * 4, dstArrayDevice + paddStride, srcWidth + paddStride, roiSize, 0, 65535);

	//int status7 = cudaMemcpyAsync(dstPinPtr, dstTest, length, cudaMemcpyDeviceToHost, testStream);
	cudaFree(srcArrayDevice);
	cudaFree(srcArrayDevice2);
	cudaFree(dstArrayDevice);
	cudaStreamDestroy(testStream);
	cudaProfilerStop();
	return 0;
}

This what I got from the Nvidia Visual Profiler: https://drive.google.com/file/d/0Bwm_zbskOasVWmpldFhBZ0VwSTg/view?usp=sharing

Why there are 2 streams?
I noticed that this behaviour is dependant from the size of the image, if width and height are set to 1500 the result is this:

https://drive.google.com/file/d/0Bwm_zbskOasVc3Qxdk9uZ3hTelU/view?usp=sharing

Is this a bug or am I missing somewthing?

This is an answer I received on another site, I’ll write it here.

It appears that nppiThreshold_LTValGTVal_32f_C1R creates its own internal stream for executing one of the kernels it uses. The other is launched either into the default stream, or the stream you specified with nppSetStream.
I think this is really a documentation oversight/user expectation problem. nppSetStream is doing what it says, but nowhere is it stated that the library is limited to using one stream. It probably should be more explicit in the documentation about how many streams the library uses internally, and how nppSetStream interacts with the library. If this is a problem for your application, I suggest you raise a bug report with NVIDIA.