NPP & stream problems?

Hello All,

I have couple of questions how Nvidia NPP library works. Maybe I missed something, but it may be issue with library itself. So let’s start.

My PC configuration is as follow:
OS: Windows 7 x64 Enterprise
Visual Studio 2015 Community Edition with update2
CUDA SDK: 8.0.61 (but I tested it with newest 9.0 and I got the same results)

GPU: GTX 980

Introduction:
In application that I’m developing, we have different modules that may work independently. Each of them can use CPU/GPU in any time for processing. If I do NOT use streams, that everything works fine, but when I start using streams couple of weird things happening… So I would like to clarify how it should work.

1) First problem - creating/destroying order have matter but it shouldn’t.
In below code, there is a flag in destructor that change order of destroying streams that was created in constructor. If destruction order is the same as construction, then NPP function fail in next iteration with no reason. If I just reverse order of stream destruction NPP function seems to be working correctly. At least doesn’t report any error. So If you want follow this problem, please compile this code with inverseOrder flag set to true, and false and check result. Is there any reasonable explanation of this behavior? What happen if many thread in my application start creating their own streams, and start destroying it in different order?
Someone can say that switching stream during using NPP is not thread safe, but it will be my further question.

#include "stdafx.h"

#include <cuda.h>
#include <cuda_runtime.h>
#include <npps.h>
#include <npp.h>

#include <stdexcept>

#define CheckCudaError(err) ErrorHandler(err, __FUNCTION__, __FILE__, __LINE__)

void ErrorHandler(int errorCode, const char* fn, const char* file, int line) {
	if (errorCode != 0) {
		char errorMessage[4096];
		sprintf_s(errorMessage, "Error in file=%s, line=%d, function=%s", file, line, fn);
		throw std::runtime_error(errorMessage);
	}
}

void ErrorHandler(cudaError_t errorCode, const char* fn, const char* file, int line) {
	ErrorHandler((int)errorCode, fn, file, line);
}
void ErrorHandler(NppStatus errorCode, const char* fn, const char* file, int line) {
	ErrorHandler((int)errorCode, fn, file, line);
}

class SomeObject {
private:
	static const int count = 3;

protected:
	int _a, _b;
	cudaStream_t _streams[count];
	void* _pDev;

public:
	SomeObject(int a, int b)
	:_a(a), _b(b) {
		CheckCudaError(cudaMalloc(&_pDev, a*b*sizeof(Npp32f)));
		for(int i=0; i < count; ++i)
			CheckCudaError(cudaStreamCreate(&_streams[i]));		
	}

	~SomeObject() {
		const bool inverseOrder = false;
		if (inverseOrder) {
			for (int i = count - 1; i >= 0; --i)
				CheckCudaError(cudaStreamDestroy(_streams[i]));
		}		
		else {
			for (int i = 0; i < count; ++i)
				CheckCudaError(cudaStreamDestroy(_streams[i]));
		}
		CheckCudaError(cudaFree(_pDev));
	}

	void DoSomething(int i) {
		nppSetStream(_streams[0]);
		CheckCudaError(nppsSet_32f(0.f, (Npp32f*)_pDev, _a*_b));
		cudaDeviceSynchronize();
	}
};

int main()
{
	const int width = 10000;
	const int height = 10000;

	try {
		for (int i = 0; i < 1000; ++i) {
			SomeObject o(width, height);
			o.DoSomething(i);
		}
	}
	catch (std::runtime_error& e) {
		printf("Caught Exception = %s.", e.what());
	}

    return 0;
}

2) Switching streams when using NPP cause error EVEN when changing stream is synchronized by cudaDeviceSynchronize() function.
Code is similar to previous one. Let’s assume that we want to do something with NPP in different streams. According to documentation, NPP function can call another NPP underneath, so changing the stream without “stream or device synchronization” can cause some trouble. So in my example (I think that) I did everything correct, but NPP function report error). Is something wrong with my code? or with NPP library?

#include "stdafx.h"

#include <cuda.h>
#include <cuda_runtime.h>
#include <npps.h>
#include <npp.h>

#include <stdexcept>

#define CheckCudaError(err) ErrorHandler(err, __FUNCTION__, __FILE__, __LINE__)

void ErrorHandler(int errorCode, const char* fn, const char* file, int line) {
	if (errorCode != 0) {
		char errorMessage[4096];
		sprintf_s(errorMessage, "Error in file=%s, line=%d, function=%s", file, line, fn);
		throw std::runtime_error(errorMessage);
	}
}

void ErrorHandler(cudaError_t errorCode, const char* fn, const char* file, int line) {
	ErrorHandler((int)errorCode, fn, file, line);
}
void ErrorHandler(NppStatus errorCode, const char* fn, const char* file, int line) {
	ErrorHandler((int)errorCode, fn, file, line);
}

class SomeObject {
private:
	static const int count = 3;

protected:
	int _a, _b;
	cudaStream_t _streams[count];
	void* _pDev;

public:
	SomeObject(int a, int b)
	:_a(a), _b(b) {
		CheckCudaError(cudaMalloc(&_pDev, a*b*sizeof(Npp32f)));
		for(int i=0; i < count; ++i)
			CheckCudaError(cudaStreamCreate(&_streams[i]));		
	}

	~SomeObject() {
		for (int i = count - 1; i >= 0; --i)
			CheckCudaError(cudaStreamDestroy(_streams[i]));

		CheckCudaError(cudaFree(_pDev));
	}

	void DoSomething(int i) {

		if (i >= count)
			i = i%count;

		nppSetStream(_streams[i]);
		CheckCudaError(nppsSet_32f(0.f, (Npp32f*)_pDev, _a*_b));
		cudaDeviceSynchronize();
	}
};

int main()
{
	const int width = 10000;
	const int height = 10000;

	try {
		SomeObject o(width, height);
		for (int i = 0; i < 1000; ++i) {	
			o.DoSomething(i);
		}
	}
	catch (std::runtime_error& e) {
		printf("Caught Exception = %s.", e.what());
	}

    return 0;
}

3) Changing stream in NPP library, have to be synchronized by “stream or device synchronization”.
I think this question is more targeted to Nvidia developers…
So, I understand that NPP use global state to keep information about current stream. But my question is what is logic behind? If there are “streams queues” in driver why information about stream and function call cannot be associated, and resolved in execution time? It allows to parallelization on GPU, and increase flexibility for developers to use powerful NPP library.

As documentation says, synchronization between changing stream and NPP function call will kill performance. So I don’t understand what is logic behind?

In this case my question is, how I can use NPP library with multiple threads, that they don’t know each other. Using serialized(stream 0) NPP functions is an option, but is not efficient.

I also read about “stream per thread”. It may be an option for multi threading solution.
Please correct me if am I wrong.

Timeline: ------------------------------------------------------------------------------->
Thread A: npps1A_fn()________ npps2A_fn()____________________________________________
Thread B: npps1B_fn()________ npps2B_fn()______________________________________________

If I will use “default stream per thread”, the execution will be as shown on above diagram? I mean that npps1A_fn() will be executed simultaneously with npps1B_fn()?

In respect to current implementation is there possible to do some thing like this:

  1. I have couple of input signals that are independent each other.
  2. Based on this signals, using different algorithms I can process this data. Algorithm is not complicated and can be realized by using combination npps/nppi functions.
  3. Let’s assume that I have to can use two different algorithms to generate output signals.
    Is there any way to parallelize sets of npps/nppi commands that I need to process one output signal, and second sets of npps/nppi commands that I need for second signal?

My way of thinking is put those sets of commands to two different non blocking streams, and synchronize at the end. But according to current implementation is not possible. Am I right?

Thank you in advance!
Bartek

did u solve the issue?