[MMAPI R28.2/R28.1] deinitPlane() of NvVideoEncoder -- Memory Leak ?

waynezhu · June 11, 2018, 1:22am

Hi Mynaemi,

I have found where the memory leakage in our library is from. Will give you a new library once fix be approved.
Pls pre-allocate dmabuf for capture, then use a queue to pass to/frame encode mode, a simple code is in comment 18.
pls use different thread for qbuffer and dqbuffer in output and capture plane, just like B01 do.

THanks
wayne zhu

mynaemi · June 11, 2018, 2:12am

Hi waynezhu,

I arreciate for your effort !!!
I’ll wait for the new library.

Is the new library same as #5 of TOPIC 1036111 ?
[url]https://devtalk.nvidia.com/default/topic/1036111/jetson-tx2/-mmapi-decode-more-than-one-h264-file-get-problem-when-trying-to-modify-00_video_decode/[/url]

Sorry, I cannot understand this in your reply 3).
“just like B01 do.”

Best Regards

waynezhu · June 11, 2018, 2:48am

Hi,
Is the new library same as #5 of TOPIC 1036111 ?

Same library, but different fix. Do you have decode function in your project?

Sorry, I cannot understand this in your reply 3).
“just like B01 do.”

In your code, qbuffer and dqbuffer is in same thread. You need put qbuffer and dqbuffer in different thread.

Thanks
wayne zhu

mynaemi · June 11, 2018, 11:29am

Hi wayne zhu,

We need decode function because I’m developing both a streaming server and a streaming client.
(Now it have one-way communication, a Jetson TX2 is a server, another Jetson TX2 is a client.
But in the next plan, it will be bidirectional, both encoder and decoder in a Jetson TX2.)

I’m facing with another issue about NvVideoDecoder of R28.2.
https://devtalk.nvidia.com/default/topic/1032037/jetson-tx2/-mmapi-r28-1-r28-1-to-reduce-dpb-delay-of-nvvideodecoder/post/5263736/#5263736

Which does this “thread” mean ?
(1) C++ (POSIX) thread
(2) TOPIC in this forum

In the comment #17, you said:
Number of qbuffer and dqbuffer is not exactly same, so for some fd, NvBufferDestroy(fd) is not called.

In the comment #18, I wrote:
I get new buffers using setupPlane(), call qBuffer() for acquiring an image/frame to the encoder, and reuse empty buffers from dqBuffer().
Do I have to use NvBufferDestroy(fd) and createNvBuffer() instead of dqBuffer() and reuse the empty buffers of output plane ?

I made a fair copy of your #17 code, in #18, it’s not mine.

I know that it is different between number of output_plane.qBuffer() and number of capture_plane.dqBuffer().
(SPS/PPS is added)
If you mean “number of output_plane.qBuffer() is not same as number of output_plane.dqBuffer()”,
I doubt it and can not understand it.

Could you please explain it in simple words ?

Best Regards

waynezhu · June 13, 2018, 3:22am

Which does this “thread” mean ?
(1) C++ (POSIX) thread
(2) TOPIC in this forum
>> (1)
In the comment #17, you said:
Number of qbuffer and dqbuffer is not exactly same, so for some fd, NvBufferDestroy(fd) is not called.

In the comment #18, I wrote:
I get new buffers using setupPlane(), call qBuffer() for acquiring an image/frame to the encoder, and reuse empty buffers from dqBuffer().
Do I have to use NvBufferDestroy(fd) and createNvBuffer() instead of dqBuffer() and reuse the empty buffers of output plane ?
>> IN setupPlane, there is a parameter V4L2_MEMORY_DMABUF/MMAP, if it is DMABUF, you have to use createNvBuffer to create a dmabuf, then pass fd to encode, if it is MMAP, a memory will be allocated in our library, you can use directly.

I made a fair copy of your #17 code, in #18, it’s not mine.

I know that it is different between number of output_plane.qBuffer() and number of capture_plane.dqBuffer().
(SPS/PPS is added)
If you mean “number of output_plane.qBuffer() is not same as number of output_plane.dqBuffer()”,
I doubt it and can not understand it.
>> we can also output two or more slices for a frame.

Could you please explain it in simple words ?

Can I have your email?
So I can send you new library.
We have fix it now.

THanks
wayne zhu
libtegrav4l2.7z (49.4 KB)

waynezhu · June 13, 2018, 3:28am

BTW, I can still see memory usage increase with your APP, but no nvmap memory leakage now.

I will debug more, you can also check whether it is CPU buffer leakage in APP.

CPU buffer: use new/malloc/calloc to allocate.
dma buffer: useNvBufferCreateEX to allocate.

Thanks
wayne zhu

mynaemi · June 13, 2018, 4:36am

Hi wayne zhu,

I appreciate you for your quick and polite response.

In our system, DMABUF is used from NvVideoConverter to NvVideoEncoder,
because it does not need memcpy() between buffers.
There are 3 threads for transfer image from NvVideoConverter to NvVideoEncoder.
(1) Converter CAPTURE plane dqThread
(2) Encoder OUTPUT plane dqThread
(3) Encoder main thread for OUTPUT plane qBuffer()

Between (1) and (3), (2) and (3), it uses 2 std::queue.(#1 and #2, below)
(1)–(#1:DMABUF)–>(3)
(2)–(#2:NvBuffer)–>(3)

Is it better that NvBufferDestroy(fd) is called in (2) Encoder OUTPUT dqThread (callback) ?
Is it better that createNvBuffer() is called in (3) Encoder main thread ?

I understood how to use NvBufferDestroy(fd).

fd = v4l2_buf.m.planes[0].m.fd;
NvBufferDestroy(fd);

But it is a little complicated how to use createNvBuffer().

NV::IImageNativeBuffer *iNativeBuffer =
	interface_cast<NV::IImageNativeBuffer>(iFrame->getImage());
if (!iNativeBuffer)
	ORIGINATE_ERROR("IImageNativeBuffer not supported by Image.");
	fd = iNativeBuffer->createNvBuffer(STREAM_SIZE,
		NvBufferColorFormat_YUV420,
		(DO_CPU_PROCESS) ? NvBufferLayout_Pitch: NvBufferLayout_BlockLinear);

Which document do I have to refer for NV::IImageNativeBuffer, iFrame->getImage() ?
And the relation between NvBuffer of NvVideoConverter capture plane and iFrame->getImage() ?

In our callback function for encoder.output_plane_dqBuffer,
it calls converter.capture_plane.qBuffer() with shared_buffer->index.

bool VideoEncoder::encoder_output_plane_dq_callback(struct v4l2_buffer *v4l2_buf,
		NvBuffer * buffer, NvBuffer * shared_buffer, void *arg) {
	VideoEncoder *venc = (VideoEncoder *)arg;
	struct v4l2_buffer ret_qbuf;
	struct v4l2_plane planes[MAX_PLANES];

	if (!v4l2_buf) {
		venc->abort();
		return false;
	}
	
	memset(&ret_qbuf, 0, sizeof(ret_qbuf));
	memset(&planes, 0, sizeof(planes));
	ret_qbuf.index = shared_buffer->index;		// DMABUF
	ret_qbuf.m.planes = planes;
	// Return buffer to conv
	if (venc->prevConv->capture_plane_qBuffer(ret_qbuf, NULL) < 0) {
		venc->abort();
		return false;
	}
	// GOT EOS from encoder. Stop dqthread.
	if (shared_buffer->planes[0].bytesused == 0) return false;

	venc->pushOutNvBuffer(buffer);	// Return Empty NvBuffer into std::queue #2
	return true;
}

In our callback function for converter.capture_plane_dqBuffer,
it uses std::queue with buffer.

bool VideoConverter::conv0_capture_dqbuf_thread_callback(
		struct v4l2_buffer *v4l2_buf, NvBuffer * buffer,
		NvBuffer * shared_buffer, void *arg) {
	VideoConverter *vcon = (VideoConverter *) arg;
	NvBuffer *conv1_buffer;
	struct v4l2_buffer conv1_qbuf;
	struct v4l2_plane planes[MAX_PLANES];
	qData_t *outBuf;  // Element for std::queue #1

	vcon->conv0_mtx.unlock();
	if (!v4l2_buf) {
		vcon->abort();
		return false;
	}

	if (v4l2_buf->m.planes[0].bytesused == 0) return false;
	// Get a buffer of queue #1 from encoder
	while ((outBuf = (qData_t*) vcon->outBufferPool.getEmptyBuffer(5)) == NULL) {
		if (vcon->eos) return false;
	}
	outBuf->data = (uint8_t*) buffer;
	outBuf->timestamp_us = timeval2usec64(&v4l2_buf->timestamp);
	// To Encoder via std::queue #1
	vcon->outBufferPool.pushFilledBuffer((void*) outBuf);

	return true;
}

In the encoder main thread, image data is from std::queue to output_plane.qBuffer().

int VideoEncoder::inBuf_to_enc() {
	int ret;
	NvBuffer *buffer = nullptr;
	NvBuffer *sbuffer = nullptr;
	qData_t *inBuf;  // Element for std::queue #1
	struct v4l2_buffer v4l2_buf;
	struct v4l2_plane planes[MAX_PLANES];
	memset(&v4l2_buf, 0, sizeof(v4l2_buf));
	memset(planes, 0, MAX_PLANES * sizeof(struct v4l2_plane));

	// Get Filled Buffer from std::queue #1
	if ((inBuf = (qData_t*)inBufferPool->getFilledBuffer(30)) == NULL) {
		if(!mEnc->isInError() && !myExitFlag && !eos) return 0;
		return -1;
	}
	sbuffer = (NvBuffer *)inBuf->data;
	v4l2_buf.timestamp.tv_sec = inBuf->timestamp_us / 1000000;
	v4l2_buf.timestamp.tv_usec = inBuf->timestamp_us % 1000000;
	v4l2_buf.sequence = seq;
	v4l2_buf.flags |= V4L2_BUF_FLAG_TIMESTAMP_COPY;

	buffer = getOutNvBuffer(50);	// Get Empty NvBuffer from std::queue #2
	if(buffer){
		v4l2_buf.index = buffer->index;
		v4l2_buf.m.planes = planes;
		if(pBroker) controlEnc();	// Change BitRate
		ret = mEnc->output_plane.qBuffer(v4l2_buf, sbuffer);
		if (ret < 0) return -1;
		seq++;
	}
	inBufferPool->pushEmptyBuffer(inBuf);	// Return Empty Buffer via std::queue #1

	return 1;
}

Thank you for your proposal.
Would you please send it to me via e-mail.
Cannot you refer my account information ?

Best Regards

mynaemi · June 13, 2018, 4:53am

Hi wayne zhu,

What is “YOUR APP” ?
If it is that I had sent in the comment #10, it is a modified sample code with MMAPI.

CHANGES:
(1) H.264 profile and level
(2) Loop execution to !ArgusSamples::execute()
(3) Change resolution outside of execute()
(4) Add logging output

Best Regards

waynezhu · June 13, 2018, 5:36am

Hi wayne zhu,

I appreciate you for your quick and polite response.

IN setupPlane, there is a parameter V4L2_MEMORY_DMABUF/MMAP,
if it is DMABUF, you have to use createNvBuffer to create a dmabuf, then pass fd to encode,
if it is MMAP, a memory will be allocated in our library, you can use directly.

we can also output two or more slices for a frame.

In our system, DMABUF is used from NvVideoConverter to NvVideoEncoder,
because it does not need memcpy() between buffers.
There are 3 threads for transfer image from NvVideoConverter to NvVideoEncoder.
(1) Converter CAPTURE plane dqThread
(2) Encoder OUTPUT plane dqThread
(3) Encoder main thread for OUTPUT plane qBuffer()

Between (1) and (3), (2) and (3), it uses 2 std::queue.(#1 and #2, below)
(1)–(#1:DMABUF)–>(3)
(2)–(#2:NvBuffer)–>(3)

Is it better that NvBufferDestroy(fd) is called in (2) Encoder OUTPUT dqThread (callback) ?
Is it better that createNvBuffer() is called in (3) Encoder main thread ?

I understood how to use NvBufferDestroy(fd).
fd = v4l2_buf.m.planes[0].m.fd;
NvBufferDestroy(fd);
If you use MMAP in SetupPlane, you can use reqbufs(0) to release.
NvBufferDestroy can also destroy this buffer, but I don’t suggest you to use. NvBufferDestroy can be used for dmabuf from NvBufferCreateEx or createNvBuffer.

But it is a little complicated how to use createNvBuffer().
NV::IImageNativeBuffer *iNativeBuffer =
	interface_cast<NV::IImageNativeBuffer>(iFrame->getImage());
if (!iNativeBuffer)
	ORIGINATE_ERROR("IImageNativeBuffer not supported by Image.");
	fd = iNativeBuffer->createNvBuffer(STREAM_SIZE,
		NvBufferColorFormat_YUV420,
		(DO_CPU_PROCESS) ? NvBufferLayout_Pitch: NvBufferLayout_BlockLinear);
Which document do I have to refer for NV::IImageNativeBuffer, iFrame->getImage() ?
And the relation between NvBuffer of NvVideoConverter capture plane and iFrame->getImage() ?

IImageNativeBuffer and NvBufferCreateEx’s function is same, just use NvBufferCreateEx instead of IImageNativeBuffer->createNvBuffer.
Do you mean NvBuffer structure? It is defined to keep dmabuf(allocated in SetupPlane)'s related info. getImage is used to get dmabuf’s fd, shall be same as NvBufferCreateEx.

In our callback function for encoder.output_plane_dqBuffer,
it calls converter.capture_plane.qBuffer() with shared_buffer->index.
bool VideoEncoder::encoder_output_plane_dq_callback(struct v4l2_buffer *v4l2_buf,
		NvBuffer * buffer, NvBuffer * shared_buffer, void *arg) {
	VideoEncoder *venc = (VideoEncoder *)arg;
	struct v4l2_buffer ret_qbuf;
	struct v4l2_plane planes[MAX_PLANES];

	if (!v4l2_buf) {
		venc->abort();
		return false;
	}
	
	memset(&ret_qbuf, 0, sizeof(ret_qbuf));
	memset(&planes, 0, sizeof(planes));
	ret_qbuf.index = shared_buffer->index;		// DMABUF
	ret_qbuf.m.planes = planes;
	// Return buffer to conv
	if (venc->prevConv->capture_plane_qBuffer(ret_qbuf, NULL) < 0) {
		venc->abort();
		return false;
	}
	// GOT EOS from encoder. Stop dqthread.
	if (shared_buffer->planes[0].bytesused == 0) return false;

	venc->pushOutNvBuffer(buffer);	// Return Empty NvBuffer into std::queue #2
	return true;
}
In our callback function for converter.capture_plane_dqBuffer,
it uses std::queue with buffer.
bool VideoConverter::conv0_capture_dqbuf_thread_callback(
		struct v4l2_buffer *v4l2_buf, NvBuffer * buffer,
		NvBuffer * shared_buffer, void *arg) {
	VideoConverter *vcon = (VideoConverter *) arg;
	NvBuffer *conv1_buffer;
	struct v4l2_buffer conv1_qbuf;
	struct v4l2_plane planes[MAX_PLANES];
	qData_t *outBuf;  // Element for std::queue #1

	vcon->conv0_mtx.unlock();
	if (!v4l2_buf) {
		vcon->abort();
		return false;
	}

	if (v4l2_buf->m.planes[0].bytesused == 0) return false;
	// Get a buffer of queue #1 from encoder
	while ((outBuf = (qData_t*) vcon->outBufferPool.getEmptyBuffer(5)) == NULL) {
		if (vcon->eos) return false;
	}
	outBuf->data = (uint8_t*) buffer;
	outBuf->timestamp_us = timeval2usec64(&v4l2_buf->timestamp);
	// To Encoder via std::queue #1
	vcon->outBufferPool.pushFilledBuffer((void*) outBuf);

	return true;
}
In the encoder main thread, image data is from std::queue to output_plane.qBuffer().
int VideoEncoder::inBuf_to_enc() {
	int ret;
	NvBuffer *buffer = nullptr;
	NvBuffer *sbuffer = nullptr;
	qData_t *inBuf;  // Element for std::queue #1
	struct v4l2_buffer v4l2_buf;
	struct v4l2_plane planes[MAX_PLANES];
	memset(&v4l2_buf, 0, sizeof(v4l2_buf));
	memset(planes, 0, MAX_PLANES * sizeof(struct v4l2_plane));

	// Get Filled Buffer from std::queue #1
	if ((inBuf = (qData_t*)inBufferPool->getFilledBuffer(30)) == NULL) {
		if(!mEnc->isInError() && !myExitFlag && !eos) return 0;
		return -1;
	}
	sbuffer = (NvBuffer *)inBuf->data;
	v4l2_buf.timestamp.tv_sec = inBuf->timestamp_us / 1000000;
	v4l2_buf.timestamp.tv_usec = inBuf->timestamp_us % 1000000;
	v4l2_buf.sequence = seq;
	v4l2_buf.flags |= V4L2_BUF_FLAG_TIMESTAMP_COPY;

	buffer = getOutNvBuffer(50);	// Get Empty NvBuffer from std::queue #2
	if(buffer){
		v4l2_buf.index = buffer->index;
		v4l2_buf.m.planes = planes;
		if(pBroker) controlEnc();	// Change BitRate
		ret = mEnc->output_plane.qBuffer(v4l2_buf, sbuffer);
		if (ret < 0) return -1;
		seq++;
	}
	inBufferPool->pushEmptyBuffer(inBuf);	// Return Empty Buffer via std::queue #1

	return 1;
}
Can I have your email?
So I can send you new library.
We have fix it now.

Thank you for your proposal.
Would you please send it to me via e-mail.
Cannot you refer my account information ?

Best Regards

Tessier · October 12, 2018, 3:03pm

Hi wayne zhu,

I have similar issue after a lot of encoder restarts with a TX2 on 28.1.

The data flow is the following:
V4L2 input (YUV422) → GPU memcpy to 1 or more converter output plane(s) with programmable frame rate decimation → CUDA processing on converter capture plane → encoder output plane → encoder capture plane

Plane setup:

conv->output_plane.setupPlane(V4L2_MEMORY_MMAP, BUFF_NUM, true, false);
    conv->capture_plane.setupPlane(V4L2_MEMORY_MMAP, BUFF_NUM, true, false);
    enc->output_plane.setupPlane(V4L2_MEMORY_DMABUF, BUFF_NUM, false, false);
    enc->capture_plane.setupPlane(V4L2_MEMORY_MMAP, BUFF_NUM, true, false);

After stop, the total number of buffer queues and dequeues, and number of queued buffers (in the same order as the setup):
Total queued buffers: 315 324 315 321
Total dequeued buffers: 305 315 314 316
Queued buffers: 10 9 1 5
This is somewhat strange, as I call conv->waitForIdle() before these prints, which should (according to the documentation): “Waits until all buffers queued on the output plane are converted and dequeued from the capture plane.”
Anyway, I tried to dequeue all buffers buffers on all planes using a code similar to this:

struct v4l2_buffer v4l2buf_i;
struct v4l2_plane v4l2_planes_i[MAX_PLANES];
NvBuffer *nvbuffer_i;
int fd = -1;
conv->output_plane.dqBuffer(v4l2buf_i, &nvbuffer_i, NULL, 10);
fd = v4l2buf_i.m.planes[0].m.fd;
NvBufferDestroy(fd);

On the converter output plane I get some prints:

nvbuf_utils: dmabuf_fd 1769239141 mapped entry NOT found
nvbuf_utils: Can not get HW buffer from FD... Exiting...

But at least the dequeue is successfull and afterwards the number of queued buffers is zero.
If I try to dequeueing any of the other planes, the thread stops at the dequeue and stays there forever.
I guess I get the error print because the converter output plane is using V4L2_MEMORY_MMAP buffers, right?

If I understand you correctly, the main reason of the memory leakage is that buffers queued on any plane are not deleted. I hoped that dequeueing an deleting is an option, but it does not seem to work. Is my idea completely wrong?
If I have to allocate buffers manually using NvBufferCreateEx, on which planes should I do that? The above print shows that all of the planes have queued buffers which are not deleted when the stream is stopped, right?

Best regards

DaneLLL · October 25, 2018, 8:14am

Hi Tessier,
Do you still suffer memory leak on r28.1? Or you have figured out a solution?

Tessier · October 25, 2018, 8:43am

Hi DaneLLL,

I still have the issue with no workaround.
So if you have any advice based on my original post, please let me know.

Thanks.

DaneLLL · October 26, 2018, 1:24am

Hi Tessier,
Please share steps to reproduce the issue in running 01_video_encode. If it requires to apply a patch to 01_video_encode, please share it also.

Tessier · October 26, 2018, 6:19am

Hi DaneLLL,

I will try to reproduce the issue with an MMAPI sample and let you know the result.

Tessier · November 23, 2018, 3:48pm

Hi DaneLLL,

I tried to reproduce my issue on a Jetson TX2 with a code based on 12_camera_v4l2_cuda_video_encode.zip form another topic:
[url]NVIDIA Multimedia APIs with UYVY sensor - Jetson TX1 - NVIDIA Developer Forums

Compared to the original code I made some modifications:

It reads video frames from file
The original code does not work on TX2 28.1, so I modified it (added an encoder output plane callback)
I do start-stop in an endless loop

After ~270 start-stop cycles, the program crashes. Unfortunately it is not the same issue as I have on our real product. Downloads:

Project (it contains a raw video, so it is large): [url]http://home.mit.bme.hu/~szanto/tegra/13_file_conv_enc.zip[/url]
Log: [url]http://home.mit.bme.hu/~szanto/tegra/13_file_conv_enc.txt[/url]

Thanks.

DaneLLL · November 26, 2018, 5:25am

Hi Tessier,
Can you try r28.2.1? Also for encoding video source, you should run 01_video_encode. Does the issue happen with 01_video_encode?

What is the issue you have on real products?

Tessier · November 26, 2018, 8:29am

Hi DaneLLL,

The basic symptom on our real product is the same as in the code I provided:

Allocate converter & encoder
Start and stop the stream in a loop
After several start/stops Multimedia API crashes

The difference is the way it crashes:

On our real product I get the same error what you can see in post 1
In the code I provided the error print is different but the effect is the same

As I wrote, to reproduce the issue I used a code provided by NVIDIA in the linked forum. It is the same as our real product in a manner that it uses VIC to convert 4:2:2 to 4:2:0 and then encodes. 01_video_encode only uses the encoder, so it is not as close to the real use case as 12_camera_v4l2_cuda_video_encode.zip

Could you please give try the code I uploaded?
I will try it on 28.2.1, but on the real product we cannot do a version upgrade right now.
Thanks.

DaneLLL · November 26, 2018, 9:08am

Hi Tessier,
Can you please try to delete and re-create NvVideoEncoder? Starting and stopping the stream in a loop is not a SQA test case we verify in BSP release, so it might be with instability.

Please also contact NVIDIA salesperson so that we can check and prioritize this issue. Thanks.

Tessier · November 26, 2018, 9:47am

Hi DaneLLL,

The reason I do not delete the encoder is that at the beginning of the development phase I found that deleting it sometimes hangs, “delete enc;” never returns. This is why I moved to allocate both the VIC and encoder statically.

I modified 01_video_encode to start/stop in a loop, it also crashes after ~274 frames, just as the VIC+encoder sample, if encoder allocation is static.
I tried non-static allocation, it did not fail yet, so I will try that again on our real product.
You can find the modified 01_vide_encode here:
[url]http://home.mit.bme.hu/~szanto/tegra/01_video_encode_mod.zip[/url]
To run it: ./video_encode bunny_1280_UYVY_420.bin 1280 720 H265 encoded.h265 -br 4000000
There is a define at the beginning of the C file
#define USE_STATIC_ENC 1
you can turn on or off the encoder static allocation if you want to give it a try.

Tessier · November 26, 2018, 12:21pm

Hi DaneLLL,

Additional info: it also crashes with non-static hardware allocation, it just takes much longer. After some time the process gets killed, most probably because it runs out of memory - as I see memory usage continuously grows in tegrastats.