CUDA fail when ffmpeg decode H264 and gstreamer encode mp4 file

I am using TX2 to detect input video(H264) and output to mp4 file.

The system is like:

3rd party library decode H264 with ffmpeg → TX2 infer decoded image → write and encode image to mp4 file(use jetson-utils videoOutput with gstreamer).

But I always got CUDA error in gstEncoder::Render() function when output image. Error is like below.
[cuda] unspecified launch failure (error 719) (hex 0x2CF)

Previously I used jetson-utils videoSource(gstreamer) decode H264 and videoOut to encode mp4 file, it has no problems.

The 3-rd party library use ffmpeg to decode H264 in another thread, I am not sure if the issue is related to multiple threads run on one GPU device. Can I call CUDA runtime API from multiple different threads?

Thanks
Harry

Hi @harry_xiaye, are you using jetson-inference for the inference portion too? If so, does the pipeline run ok with no video output or videoOutput('display://0')? I am wondering if there is problem earlier in the pipeline.

If you aren’t using jetson-inference for inference, what kind of image are you feeding gstEncoder::Render()? Is the memory been allocated on the GPU?

@dusty_nv , thanks your reply.

I am not using jetson-inference. The image I feeding gstEncoder::Render is RGB3 data. Actually I used videoOutput in jetson-utils to output image to mp4 file. It will call output->Render(image, Width, Height) to feed the image to gstEncoder::Render. The image is buffer of uchar3 type, and this memory of image is not allocated in GPU.

With the same inference and output(decode to mp4) code, if I use jetson-utils videoSource and input, I have no any issues.

Ah ok, gotcha - the memory would need to be allocated on GPU. Try using the cudaAllocMapped() function - this will allocate memory that is shared between the CPU and GPU (since Jetson shares the same physical memory between the CPU/GPU, it can use zero-copy memory).

If you image is in another buffer, you can do a simply memcpy() to the buffer you allocate with cudaAllocMapped(), since the cudaAllocMapped() pointer is accessible from both the CPU and GPU. Then pass that pointer from cudaAllocMapped() to gstEncoder::Render().

OK. I can try this method.

So when I used jetson-utils videoSource to decode H264, the image memory is allocated on GPU, that is why I have no issues when using videoSource as input, right?

Right yes, the memory that videoSource returns was already allocated with cudaAllocMapped()

@dusty_nv , your solution works very well!

BTW, I should use cudaFreeHost to free the memory allocated by cudaAllocMapped, right?