Npp with multiple Streams

sanzo · August 9, 2016, 10:13am

I’m working on an image processing project and I decided to use streams.
I created a stream and then I used nppSetStream.

I invoked the function nppiThreshold_LTValGTVal_32f_C1R but 2 stream are used when the function is executed.

Here there’s a code example:

#include <npp.h>
#include <cuda_runtime.h>
#include <cuda_profiler_api.h>

int main(void) {
	
	int srcWidth = 1344;
	int srcHeight = 1344;
	int paddStride = 0;
	float* srcArrayDevice;
	float* srcArrayDevice2;
	unsigned char* dstArrayDevice;

	int status = cudaMalloc((void**)&srcArrayDevice, srcWidth * srcHeight * 4);
	status = cudaMalloc((void**)&srcArrayDevice2, srcWidth * srcHeight * 4);
	status = cudaMalloc((void**)&dstArrayDevice, srcWidth * srcHeight );

	cudaStream_t testStream;
	cudaStreamCreateWithFlags(&testStream, cudaStreamNonBlocking);
	nppSetStream(testStream);

	NppiSize roiSize = { srcWidth,srcHeight };
	//status = cudaMemcpyAsync(srcArrayDevice, &srcArrayHost, srcWidth*srcHeight*4, cudaMemcpyHostToDevice, testStream);

	int yRect = 100;
	int xRect = 60;
	float thrL = 50;
	float thrH = 1500;
	NppiSize sz = { 200, 400 };

	for (int i = 0; i < 10; i++) {
		int status3 = nppiThreshold_LTValGTVal_32f_C1R(srcArrayDevice + (srcWidth*yRect + xRect)
			, srcWidth * 4
			, srcArrayDevice2 + (srcWidth*yRect + xRect)
			, srcWidth * 4
			, sz
			, thrL
			, thrL
			, thrH
			, thrH);
	}

	int length = (srcWidth + paddStride)*srcHeight;
	int status6 = nppiScale_32f8u_C1R(srcArrayDevice, srcWidth * 4, dstArrayDevice + paddStride, srcWidth + paddStride, roiSize, 0, 65535);

	//int status7 = cudaMemcpyAsync(dstPinPtr, dstTest, length, cudaMemcpyDeviceToHost, testStream);
	cudaFree(srcArrayDevice);
	cudaFree(srcArrayDevice2);
	cudaFree(dstArrayDevice);
	cudaStreamDestroy(testStream);
	cudaProfilerStop();
	return 0;
}

This what I got from the Nvidia Visual Profiler: https://drive.google.com/file/d/0Bwm_zbskOasVWmpldFhBZ0VwSTg/view?usp=sharing

Why there are 2 streams?
I noticed that this behaviour is dependant from the size of the image, if width and height are set to 1500 the result is this:

https://drive.google.com/file/d/0Bwm_zbskOasVc3Qxdk9uZ3hTelU/view?usp=sharing

Is this a bug or am I missing somewthing?

sanzo · August 31, 2016, 9:03am

This is an answer I received on another site, I’ll write it here.

It appears that nppiThreshold_LTValGTVal_32f_C1R creates its own internal stream for executing one of the kernels it uses. The other is launched either into the default stream, or the stream you specified with nppSetStream.
I think this is really a documentation oversight/user expectation problem. nppSetStream is doing what it says, but nowhere is it stated that the library is limited to using one stream. It probably should be more explicit in the documentation about how many streams the library uses internally, and how nppSetStream interacts with the library. If this is a problem for your application, I suggest you raise a bug report with NVIDIA.

Topic		Replies	Views
NPP Stream crash GPU-Accelerated Libraries	5	2559	March 21, 2017
Using multiple streams in npp GPU-Accelerated Libraries npp	0	1073	January 25, 2022
NPP & stream problems? GPU-Accelerated Libraries npp	1	1720	October 12, 2021
Using nppiMean_StdDev_8u_C1R after setNppStream returns NPP_RANGE_ERROR GPU-Accelerated Libraries	2	1729	March 20, 2018
How to use streams with npp APIs in CUDA Container: CUDA	0	1368	March 9, 2022
nppiResize_8u_C1R function CUDA Programming and Performance	2	1600	May 19, 2015
NppStreamContext usage for nppi_Ctx functions GPU-Accelerated Libraries	0	505	April 23, 2020
using npp on multiple stream CUDA Programming and Performance	2	1462	July 12, 2013
NPP row and column filters GPU-Accelerated Libraries	7	2835	December 3, 2015
NppStreamContext usage for nppi"Name"_Ctx functions CUDA Programming and Performance	0	938	April 22, 2020

Npp with multiple Streams

Related topics