Random cuda 719 errors during inference

Our application loads multiple models, caffe and onnx (pytorch). recently we added an onnx model based on resnext 101. adding this model causes the application to get cuda error 719 between 10 minutes to 1 hour from the start of the run.

Application is using TensorRT, cuda 10.1 and cudnn issue appears on Windows only. Linux does not have it.

observed on multiple GPU’s, turing and pascal, and multiple driver versions.

the application uses a single thread to launch all the models. each model is launched asynchronously. all models share the same cuda stream.

How can I determine what is the issue? this model alone runs fine in trtexec but it seems like it’s behaving unexpectedly when coupled with one of our other models.



Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory. This leaves the process in an inconsistent state and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched.

Please refer below link for more details (enum CUresult -> CUDA_ERROR_LAUNCH_FAILED = 719):

One thing can be tried to put sync between running models, that tells if it is issue with synchronization.
I guess this can also occur if the kernel takes longer time and display is shared with the same GPU.

Also, could you please try cuda-memcheck: https://docs.nvidia.com/cuda/cuda-memcheck/index.html to try to find the issue in this case.


Thanks for your fast reply.
I was able to reproduce the issue with a minimal example based on trtexec. Just before inference loop I start a thread which does small async non contiguous memory transfers from host to device
When this thread runs in parallel with my model many iterations I get 719 error. For some reason, this only happens with resnext101 based models in onnx format. I tested mobilenet and vgg and it does not reproduce.

void copyThread(AllOptions options)
	cudaStream_t stream;
	CHECK(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking));

	int bufferSize = 5 * 1024 * 1024;
	char* deviceBuffer;
	CHECK(cudaMalloc(&deviceBuffer, bufferSize));

	std::vector<char> cpuMem;
	while (1)
		//Do small aync copies
		for (int i = 0; i < 10; ++i)
			CHECK(cudaMemcpyAsync(deviceBuffer + i * 2048, &cpuMem.at(i * 1024), 1024, cudaMemcpyHostToDevice, stream));

I will try to reproduce it with TensorRT7 and Cuda 10.2, however upgrading will require calibrating our models again and it is not possible in our given time frame. I will be happy to know if there is any way to work around this issue.



Could you please share the repro script and model files so we can help better?


Link to a model that reproduces the issue

Modified source code of trtexec.cpp
I built it using Visual studio 2015. linked against cuda 10.1.243 and cudnn cuda 10.1

run it with the following arguments:
trtexec.exe --onnx=last_model.onnx --iterations=100000

GTX 1080Ti with driver version 441.22

I made the same test based on trtexec from tensorrt 7 with cuda 10.2 and the issue was not reproduced.


Is issue resolved on TRT 7 version?

Yes. TRT7 has no issue