GTX980 NVENC : Subframe readback support

I’m using NVENC (API 6.0) on GTX980 (Windows10 + latest drivers). Came across encoder capability named NV_ENC_CAPS_SUPPORT_SUBFRAME_READBACK and found this is not supported. I assume this capability indicates async readback of encoded slice data in slice mode.

Any idea on if this capability supported on any of existing or upcoming graphics cards?
Or feature will get enabled with future driver updates?

NV_ENC_CAPS_SUPPORT_SUBFRAME_READBACK is available on many cards, on NVIDIA GeForce GTX 980 Ti via NVENCAPI_VERSION 8.0 in particular.

You can find some details on the feature applicability here pages 23, 39.

Do NV_ENC_CAPS_SUPPORT_SUBFRAME_READBACK should be work under async mode?

As far as I remember you need to poll anyway, the event will not get you notification on partial data availability.

Thank you. If set enableEncodeAsnc to 1, will encoder take encoded slice to outbut bitstream buffer even if the frame is not complete finished? Does nvenc support read encoded slice data when the frame is being encoding yet? I try many times but failed.

Does NV_ENC_CAPS_SUPPORT_SUBFRAME_READBACK mean slice level readback? Set enableSubFrameWrite to 1 mean when a slice of a frame is encoded, it will be immediately write to output buffer, so it can be read immediately even if other slices is still encoding?

In https://on-demand.gputechconf.com/gtc/2014/presentations/S4654-detailed-overview-nvenc-encoder-api.pdf
It said that Poll and read data till NV_ENC_LOCK_BITSTREAM::hwEncodeStatus = 2, what does it mean?
I find if I set enableEncodeAsync to 1, the NV_ENC_LOCK_BITSTREAM::numSlices is always 0.

As far as I remember, partial “subframe” completion is essentially availability of complete NAL unit which does not yet make it a full frame. So, yes, it’s probably encoded slice. Again to my best knowledge the thing worked like this: NAL is added and running frame size size is updated as well. Then later when next NAL is available, it is incrementally appended to the buffer. And you can read it again since further updates will just add up without changing already added data. Once in a while the entire frame is completed.

Thank you, I will try it again.

Do you still remeber it should use async or sync mode? Should read sub frame in another thread?

To my best knowledge you can still do both, but async event notification will just notify you on 100% frame completion. Along with that you can poll and see the frame being populating NAL by NAL. In sync mode you just poll in a similar way without having an event.

Then can we control how many sub frames a picture to encode? Is it set by slice mode? Or is it not relate to slice mode? I find nvidia samples use multiple input-output buffer? Should I change it to only one ?

I think everything creating separate VCL NAL units is potentially enabling subframe readback. Enabled slice mode is the obvious option. I would expect enabled infra-refresh to result in similar behavior too.

With “traditional” one NAL per frame style, however, it is unlikely that subframe readback is helpful since (if) no incomplete NALs are reported.

In nvidia nvenc samples, a variable m_nOutputDelay is used. What does it mean? The encoder output encoded data to buffer m_iToSend%m_nEncoderBuffer.
picParams.outputBitstream = m_vBitstreamOutputBuffer[m_iToSend % m_nEncoderBuffer];

But when it read data, it is below:
unsigned i = 0;
int iEnd = bOutputDelay ? m_iToSend - m_nOutputDelay : m_iToSend;
for (; m_iGot < iEnd; m_iGot++)
{
WaitForCompletionEvent(m_iGot % m_nEncoderBuffer);
NV_ENC_LOCK_BITSTREAM lockBitstreamData = { NV_ENC_LOCK_BITSTREAM_VER };
lockBitstreamData.outputBitstream = vOutputBuffer[m_iGot % m_nEncoderBuffer];
lockBitstreamData.doNotWait = true;
NVENC_API_CALL(m_nvenc.nvEncLockBitstream(m_hEncoder, &lockBitstreamData));

	uint8_t *pData = (uint8_t *)lockBitstreamData.bitstreamBufferPtr;
	if (vPacket.size() < i + 1)
	{
		vPacket.push_back(std::vector<uint8_t>());
	}
	vPacket[i].clear();
	vPacket[i].insert(vPacket[i].end(), &pData[0], &pData[lockBitstreamData.bitstreamSizeInBytes]);
	i++;

	NVENC_API_CALL(m_nvenc.nvEncUnlockBitstream(m_hEncoder, lockBitstreamData.outputBitstream));

	if (m_vMappedInputBuffers[m_iGot % m_nEncoderBuffer])
	{
		NVENC_API_CALL(m_nvenc.nvEncUnmapInputResource(m_hEncoder, m_vMappedInputBuffers[m_iGot % m_nEncoderBuffer]));
		m_vMappedInputBuffers[m_iGot % m_nEncoderBuffer] = nullptr;
	}

	if (m_bMotionEstimationOnly && m_vMappedRefBuffers[m_iGot % m_nEncoderBuffer])
	{
		NVENC_API_CALL(m_nvenc.nvEncUnmapInputResource(m_hEncoder, m_vMappedRefBuffers[m_iGot % m_nEncoderBuffer]));
		m_vMappedRefBuffers[m_iGot % m_nEncoderBuffer] = nullptr;
	}

	int cnt = lockBitstreamData.numSlices;
}

Why it is not directly read from output buffer m_iToSend % m_nEncoderBuffer

It looks like intentional delay, does not look related to subframe data

It said that Poll and read data till NV_ENC_LOCK_BITSTREAM::hwEncodeStatus = 2, what does it mean?
while (lockBitstreamData.hwEncodeStatus != 2)
{
uint8_t *pData = (uint8_t *)lockBitstreamData.bitstreamBufferPtr;
if (vPacket.size() < i + 1)
{
vPacket.push_back(std::vector<uint8_t>());
}
vPacket[i].clear();
vPacket[i].insert(vPacket[i].end(), &pData[0], &pData[lockBitstreamData.bitstreamSizeInBytes]);
i++;
}
I used it, but it becomes a dead loop.

Don’t keep the buffer locked while looping.

Then when read output buffer, how do I know read how many bytes once?
uint8_t *pData = (uint8_t *)pEnc->m_vBitstreamOutputBuffer[pEnc->m_iToSend % pEnc->m_nEncoderBuffer];
like this, I get output buffer pointer, but I do not know how may bytes that I shoud read.
Is there a variable show the position?

At some point (while buffer is not locked!)an update lands and fills buffer bytes, updates length and hwEncodeStatus. You poll for these changes.

std::vector<std::vector<uint8_t>> vPacket;
NV_ENC_LOCK_BITSTREAM lockBitstreamData = { NV_ENC_LOCK_BITSTREAM_VER };
lockBitstreamData.outputBitstream = pEnc->m_vBitstreamOutputBuffer[pEnc->m_iToSend % pEnc-_nEncoderBuffer];
lockBitstreamData.doNotWait = true;
int i = 0;
while(lockBitstreamData.hwEncodeStatus != 2) {
uint8_t *pData = (uint8_t *)lockBitstreamData.bitstreamBufferPtr;
if (vPacket.size() < i + 1)
{
vPacket.push_back(std::vector<uint8_t>());
}
vPacket[i].clear();
vPacket[i].insert(vPacket[i].end(), &pData[0], &pData[lockBitstreamData.bitstreamSizeInBytes]);
i++;
}

Is it right?