CUDA operation using TensorRT results

We want to subtract denoised images (the result of DnCNN) from our original image, and save the subtracted images which meet our criteria.
We can get denoised images, but for now our code copy the whole result to CPU and operate(subtract + check the condition), so stuck in the bottleneck problem.
We think about doing CUDA operations on GPU -> reducing the output images to solve it, but we have no idea…

Is it possible that we do CUDA operations using the output images of TensorRT ??

Provide details on the platforms you are using:
OS : Windows 10
GPU : RTX 2080
Nvidia driver version : 441.12
CUDA 10.0
CUDNN 7.6.1
TensorRT version


Yes, i think it is possible that we do CUDA operations using the output images of TensorRT.
Before the output is copied back to the host, TensorRT’s output should be in device memory only.
You can pass the output buffer to CUDA operations for further processing.

Please refer to below sample links in case it helps:


Thanks to your advice, we made some progress. But, we’re still having a problem…

We’re using a modified version of sample code in TensorRT called “sampleONNXMNIST”. We somewhat changed sampleOnnxMNIST.cpp and converted it to .cu file to make additional CUDA operation.

We partially succeeded - the first batch of images was inferred and we had some subtracted images. However, the error occurs in the next loop when buffers.copyInputToDeviceAsync(stream) is called (line 53 of the block below).

Note that the function called image_process is one of our change.

void image_process(float *ins, float *outs, float *dst, int &num_batch, int &rows, int &cols, cudaStream_t stream) {
	my_kernel <<<num_batch, rows, cols, stream >>> (ins, outs, dst, rows, cols);

__global__ void my_kernel(float *ins, float *outs, float *dst, int &rows, int &cols) {
	const int2 thread_2D_pos = make_int2(blockIdx.x * blockDim.x + threadIdx.x,
		blockIdx.y * blockDim.y + threadIdx.y);

	int thread_1D_pos = thread_2D_pos.y * cols + thread_2D_pos.x;

	dst[thread_1D_pos] = abs(outs[thread_1D_pos] - ins[thread_1D_pos]);

	if (dst[thread_1D_pos] > 20) dst[thread_1D_pos] = 255.0;

bool SampleOnnxMNIST::infer()

	// Create RAII buffer manager object
	samplesCommon::BufferManager buffers(mEngine, mParams.batchSize);

	auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
	if (!context)
		return false;
	const int digit = 1;

	// Load Input data from HD to memory
	if (!loadInput_opencv())
		return false;
	// Create CUDA stream for the execution of this inference.
	cudaStream_t stream;

	//for (int i = 0; i < batch_num ; i++) {
	for (int i = 0; i < batch_num + 1; i++) {

		// Read the input data into the managed buffers
		assert(mParams.inputTensorNames.size() == 1);
		if (!processInput_opencv(buffers, mParams.inputTensorNames[0], i))
			return false;

		// Memcpy from host input buffers to device input buffers
		buffers.copyInputToDeviceAsync(stream); // error...

		const auto t_start = std::chrono::high_resolution_clock::now();

		if (!context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(), stream, nullptr))
			return false;

		const auto t_end = std::chrono::high_resolution_clock::now();
		const float ms = std::chrono::duration<float, std::milli>(t_end - t_start).count();

		//for (int i = 0; i < 8; i++) outLst.push_back(cv::Mat::zeros(Size(128, 128), CV_32FC1));

		//buffers.simpleTest(outLst, mParams.outputTensorNames[0]);
		auto bufsize = buffers.size(mParams.outputTensorNames[0]);
		float *ins = buffers.getData(mParams.inputTensorNames[0]);
		float *outs = buffers.getData(mParams.outputTensorNames[0]);
		float *dst;
		cudaMalloc(&dst, mParams.batchSize*mInputDims.d[1]*mInputDims.d[2]*sizeof(float));
		image_process(ins, outs, dst, mParams.batchSize, mInputDims.d[1], mInputDims.d[2], stream);

		gLogInfo << std::endl;
		gLogInfo << " Average tact time per Batch is " << ms << " ms." << std::endl;
		gLogInfo << " Average tact time per Image is " << ms / mParams.batchSize << " ms." << std::endl;

		//void simpleTest(vector<cv::Mat> &imgset, const std::string& tensorName)
		//for (int i = 0; i < 8; i++) {
		//	outLst[i].convertTo(outLst[i], CV_8UC1);
		//	char buffer[100];
		//	sprintf_s(buffer, "%d.bmp", count);
		//	count++;
		//	imwrite(buffer, outLst[i]);
		//	//waitKey(100);
		//bool outputCorrect = verifyOutput(buffers, mParams.outputTensorNames[0], digit);
		//cout << 1 << endl;
		bool outputCorrect = verifyOutput_array(dst,buffers, mParams.outputTensorNames[0]);
		//cout << 2 << endl;


	// Wait for the work in the stream to complete

	// Release stream

	// Check and print the output of the inference
	// There should be just one output tensor
	assert(mParams.outputTensorNames.size() == 1);

	//// Asynchronously copy data from device output buffers to host output buffers

	//// Wait for the work in the stream to complete

	//// Release stream

	//// Check and print the output of the inference
	//// There should be just one output tensor
	//assert(mParams.outputTensorNames.size() == 1);
	////bool outputCorrect = verifyOutput(buffers, mParams.outputTensorNames[0], digit);
	//bool outputCorrect = verifyOutput(buffers, mParams.outputTensorNames[0]);

	return true;


It seems that the error is due to line 16 of the block below.


void memcpyBuffers(const bool copyInput, const bool deviceToHost, const bool async, const cudaStream_t& stream = 0)
        for (int i = 0; i < mEngine->getNbBindings(); i++)
            void* dstPtr
                = deviceToHost ? mManagedBuffers[i]-> : mManagedBuffers[i]->;
            const void* srcPtr
                = deviceToHost ? mManagedBuffers[i]-> : mManagedBuffers[i]->;

			const size_t byteSize = mManagedBuffers[i]->hostBuffer.nbBytes();
            const cudaMemcpyKind memcpyType = deviceToHost ? cudaMemcpyDeviceToHost : cudaMemcpyHostToDevice;
            if ((copyInput && mEngine->bindingIsInput(i)) || (!copyInput && !mEngine->bindingIsInput(i)))
                if (async)
                   CHECK(cudaMemcpyAsync(dstPtr, srcPtr, byteSize, memcpyType, stream)); // error
                   CHECK(cudaMemcpy(dstPtr, srcPtr, byteSize, memcpyType));


Could you please share the exact error msg along with stack trace so we can help better?
Also, if possible please share the repro script.



We can see only this error message on cmd.

Cuda failure: 4

Also, here is call stack window on visual studio.

[inline frame] sampleONNXMNIST_cu.exe!samplesCommon::BufferManager::memcpyBuffers(const bool) line 484
location: c:\tensorrt-\samples\common\buffers.h(484)
[inline frame] sampleONNXMNIST_cu.exe!samplesCommon::BufferManager::copyInputToDeviceAsync(CUstream_st * const &) line 421
location: c:\tensorrt-\samples\common\buffers.h(421)
sampleONNXMNIST_cu.exe!SampleOnnxMNIST::infer() line 304
location: c:\tensorrt-\samples\sampleonnxmnist_cu\sampleonnxmnist_cu\
sampleONNXMNIST_cu.exe!main(int argc, char * * argv) line 836
location: c:\tensorrt-\samples\sampleonnxmnist_cu\sampleonnxmnist_cu\

buffes.h(484) is in the code block of my previous reply

It seems you have commented the synchronize API operation in code.
For all async calls you either have to use synchronize operation or if you have some dependent work that you want to queue up on GPU you can use CUDA events.


We tried synchronizing after all async calls and changing async calls to sync calls. But it still doesn’t work… :-(


You can try using CUDA memcheck to debug this issue:


We solved problem. Wrong t_size in kernel was problem I guess…

Thank you for the response. We’ll post when we need more help :)

And for our case, memcheck did not work. it took hours and happened nothing, so I killed process.