Manage creating/using/destroying multiple CUDA streams and device buffers

So I have a class like this:

class InferenceObject
{
public:
	// CUDA stream created with cudaStreamNonBlocking
	cudaStream_t stream;
	// Pointers to GPU memory allocated with cudaMalloc()
	void* bindings[3];
	// Flag of started enqueue() with this stream and these bindings
	bool started = false;
    // Network class has ICudaEngine with dims for bindings* allocation
	void create(Network* network);
    // Deallocates memory and destroys CUDA stream
	void clear();
	// Returns true if (started == true && cudaStreamQuery(stream) == cudaError::cudaSuccess)
	bool cudaExecutionCompleted();

	InferenceObject();
	InferenceObject(Network* network);
	~InferenceObject();
};

I process frames from video stream and create cudaStream_t stream and void* bindings for each processed frame to track down when execution of separate frames is completed.
All objects of InferenceObject class I store in std::map.
But.
When I try to use std::map<int, InferenceObject> I get CUDA error 17 invalid device pointer.
When I try to use std::map<int, InferenceObject*> I get no errors but network sees nothing.
Usage in detector with std::map<int, InferenceObject*>:

int Detector::startObjectsDetection(const cv::Mat& image)
{
	// Generate new object id
	int objectsDetectionId = generateObjectsDetectionId();
	// Create object
	InferenceObject* inferObject = new InferenceObject();
	// Create stream and allocate buffers
	inferObject->create(fNetwork);
	// Convert data
	auto blob = convertImageToBlob(image, fNetwork->fInputDims.d[1], fNetwork->fInputDims.d[2]);
	// Copy HtD
	CHECK(cudaMemcpy(inferObject->bindings[0], (void*)(blob.data()), fNetwork->inputBufferSize() * sizeof(float), cudaMemcpyHostToDevice));
	// Start inference
	fContext->enqueue(1, inferObject->bindings, inferObject->fStream, NULL);
	// Set flag
	inferObject->started = true;
	// Add into map
	fCurrentObjectsDetectionsMap.insert(std::make_pair(objectsDetectionId, inferObject));
	// Return generated id
	return objectsDetectionId;
}

and

std::vector<DetectedObject> Detector::retrieveDetectedObjects(int objectsDetectionId, float detectionThreshold, float intersectionOverUnionThreshold)
{
	// Vector of detected objects
	std::vector<DetectedObject> detectedObjects;
	// Find exact InferenceObject
	auto objectsDetection = fCurrentObjectsDetectionsMap.find(objectsDetectionId);
	// If found one
	if (objectsDetection != fCurrentObjectsDetectionsMap.end())
	{
		// Copy output data buffers DtH
		CHECK(cudaMemcpy(fOutBuffers[0], objectsDetection->second->bindings[1], fNetwork->outputBufferSize(0) * sizeof(float), cudaMemcpyDeviceToHost));
		CHECK(cudaMemcpy(fOutBuffers[1], objectsDetection->second->bindings[2], fNetwork->outputBufferSize(1) * sizeof(float), cudaMemcpyDeviceToHost));
		// Process buffers
		processNetworkOutput(fOutBuffers, detectionThreshold, detectedObjects);
		// Set flag
		started = false;
		std::cout << "\t\tBefore filtering: " << detectedObjects.size() << "\n";
		// Filter objects by IOU
		filterDetectedObjects(detectedObjects, detectionThreshold, intersectionOverUnionThreshold);
		std::cout << "\t\tAfter filtering: " << detectedObjects.size() << "\n";

		objectsDetection->second->clear();
		// Remove InferenceObject from map
		fCurrentObjectsDetectionsMap.erase(objectsDetection);
	}
	// Return detected objects
	return detectedObjects;
}

When I used only one CUDA stream per detector all worked just fine.
But when I tried to separate streams into another object I got all these problems.
Any ideas what it is caused by?
Thanks in advance.

Hi,
Please refer to the links below in case it helps


https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-700/tensorrt-best-practices/index.html#streaming
Also request you to share the sample code along with verbose error log to reproduce the issue so we can help better.
Thanks!

How can I get it? I found no information about it except overriding nvinfer1::ILogger::log() to use in createInferBuilder and createParser in samples.
Currently the only output I get is this string in terminal: "CUDA error 17: invalid device pointer" if I’m not using pointers. No other output from CUDA or TensorRT whatsoever.
If you can share info about how to get verbose output from my code I can do this.

Also now I think it is because STL containers (including std::map) make their own copy of inserted object but it shouldn’t happen when using pointers, right? And I’m getting wrong output (w/o errors but still) even when I’m using pointers.
Or maybe it has something to do with use of std::shared_ptr in all C++ samples?

No, it didn’t unfortunately. I found one interesting thing I didn’t know but it isn’t related to this subject.

Please set the log severity to INFO to get detailed logs.
Thanks!

Again: how? What log exactly? Where is this logger have to be? How to use it?
If I’d knew how to do this I wouldn’t ask.

Can you please share your code, so that we can reproduce the issue?
Thanks!

I’ll make a minimal example.

Seems like in this particular case I was getting error 17 because of Core.cpp:569:

objectsDetection->second->clear();

This clear() wass called twice: once manually and second in destructor.
But after removing it still data in bindings buffer is not correct.
And I have no idea why can it be.

I found out after first enqueue() every next will return one and the same result no mater what input it gets, which is strange.
When I replace it with execute() I got all results correct with same buffers on device and host.
This means it is unlikely because of buffers.
So what can it be then? Not correct CUDA streams or what?

This variant of program was ditched and thus solution was never found.
I think it is probably related to desynchronization between cudaMemcpy calls in default CUDA stream and async executions in other created streams.

Hi @redoutracer,
Our team is looking into this .
We will keep you posted on the response.
Thank you for your patience and understanding.