DirectX 11 and asynchronous encoding

What are the considerations for performing asynchronous encoding with DX11 resources re: thread safety?

As part of my render loop I call NvEncEncodePicture using a mapped and registered DX11 texture. The mapped texture is a copy and is not accessed elsewhere, per the documentation for NvEncMapInputResource.

When the render loop does no other work but the encoding it works fine, but when rendering is enabled it quickly deadlocks. Everyone is using the same immediate context. The resource to encode is unmapped in the secondary thread that handles the output.

If it is required to put a critical section around the encoding functions, then it seems I might as well use the synchronous encoding model.

1 Like

Are you doing a un-map of the surface using “NvEncUnmapInputResource” before using it for rendering?

Please also have a look at the comments in “nvEncodeAPI.h” places atop “NvEncMapInputResource” and “NvEncUnmapInputResource”

Thanks,
Ryan Park

Ryan,

Thank you kindly for your reply.

Before calling NvEncodePicture I first copy the texture to a local directx texture2d buffer with CopyResource, and then call NvEncMapInputResource. The copy is necessary because it is a render target, and I have a queue of buffers. On the secondary thread that receives the output buffer I call lock/copy/unlock and finally NvEncUnmapInputResource on the buffer.

But if I must wait until after NvEncUnmapInputResource to do other rendering, then how is this aysyncrhonous? What is the advantage of this approach over a synchronous model (which works fine, takes about 6ms, but is kind of a long time when doing VR).

Your sequence of calls look good to me. Just to double confirm, I am assuming that you are doing the following:

  1. You are enabling the async mode using NV_ENC_INITIALIZE_PARAMS::enableEncodeAsync
  2. When you are using the Async mode, you are creating the completion event and registering the event using NvEncRegisterAsyncEvent(..)
  3. You are passing the event in NV_ENC_PIC_PARAMS::completionEvent.

In case you are already doing all these, can you just share the source code? If it’s not possible to share the entire source code is it possible to create a sample for us and share it?

Would also like to know a bit more about your use-case.

  1. What is your desired encode speed?
  2. What resolution and bit-depth are you operating at?
  3. Which preset and RC mode you are using? And what all encoding features are you enabling?
  4. Which GPU card and driver are you using?

Thanks,
Ryan Park

Ryan,

The code is below. When running this, it will deadlock after a few frames, or give the following error:

D3D11 CORRUPTION: ID3D11DeviceContext::DecoderExtension: Two threads were found to be executing functions associated with the same Device[Context] at the same time. This will cause corruption of memory. Appropriate thread synchronization needs to occur external to the Direct3D API (or through the ID3D10Multithread interface). 21196 and 27928 are the implicated thread ids. [ MISCELLANEOUS CORRUPTION #28: CORRUPTED_MULTITHREADING]

While the render loop does not access the buffers that are created for decoding, it will use the same context. Some thread synchronization can be used to address this, but at that point isn’t the encoding going to be essentially synchronous at that point?

#pragma once

#include "nvEncodeAPI.h"

namespace RoomAlive
{
	typedef NVENCSTATUS(__stdcall *MYPROC)(NV_ENCODE_API_FUNCTION_LIST *);
#define SET_VER(configStruct, type) { configStruct.version = type##_VER; }

	struct MyBuffer
	{
		void* hInputBuffer;
		void* registeredResource;
		void* hBitstreamBuffer;
		HANDLE hOutputEvent;
		NV_ENC_BUFFER_FORMAT bufferFormat;
		LARGE_INTEGER startingTime;
		NV_ENC_INPUT_PTR mappedResource;
	};

	class NvEncodeAsynchronous
	{
	public:
		NvEncodeAsynchronous(std::shared_ptr<Context> context, CComPtr<ID3D11Texture2D> inputTexture) : context(context), inputTexture(inputTexture)
		{
			NVENCSTATUS nvStatus;

			// get the nvEncode interface
			HINSTANCE instance = LoadLibrary(TEXT("nvEncodeAPI64.dll"));
			MYPROC nvEnodeAPICreateInstance = (MYPROC)GetProcAddress(instance, "NvEncodeAPICreateInstance");
			encodeAPI = new NV_ENCODE_API_FUNCTION_LIST;
			memset(encodeAPI, 0, sizeof(NV_ENCODE_API_FUNCTION_LIST));
			encodeAPI->version = NV_ENCODE_API_FUNCTION_LIST_VER;
			nvStatus = nvEnodeAPICreateInstance(encodeAPI);

			// open an encode Session
			NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS encodeSessionExParams;
			memset(&encodeSessionExParams, 0, sizeof(NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS));
			SET_VER(encodeSessionExParams, NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS);
			encodeSessionExParams.device = (void*)(context->d3d11Device.p); // need AddRef?
			encodeSessionExParams.deviceType = NV_ENC_DEVICE_TYPE_DIRECTX;
			encodeSessionExParams.apiVersion = NVENCAPI_VERSION;
			nvStatus = encodeAPI->nvEncOpenEncodeSessionEx(&encodeSessionExParams, &encoder);

			D3D11_TEXTURE2D_DESC desc;
			inputTexture->GetDesc(&desc);
			width = desc.Width;
			height = desc.Height;

			GUID encodeGUID = NV_ENC_CODEC_H264_GUID;
			GUID presetGUID = NV_ENC_PRESET_DEFAULT_GUID;

			// initialize hardware encoder session with reasonable defaults
			NV_ENC_INITIALIZE_PARAMS initializeParams;
			memset(&initializeParams, 0, sizeof(NV_ENC_INITIALIZE_PARAMS));
			SET_VER(initializeParams, NV_ENC_INITIALIZE_PARAMS);
			initializeParams.encodeGUID = encodeGUID;
			initializeParams.encodeWidth = width;
			initializeParams.encodeHeight = height;
			initializeParams.darWidth = width;
			initializeParams.darHeight = height;
			initializeParams.frameRateNum = 60;
			initializeParams.frameRateDen = 1;
			initializeParams.presetGUID = presetGUID;
			initializeParams.enableEncodeAsync = 1; // asynchronous
			initializeParams.enablePTD = 1; // let encoder descide on picture type (I, P, B)
			nvStatus = encodeAPI->nvEncInitializeEncoder(encoder, &initializeParams);
			
			// client should allocate at least 1 + no. B-frames buffers
			for (int i = 0; i < 16; i++)
			{
				HRESULT hr;

				ID3D11Texture2D* texture;
				{
					D3D11_TEXTURE2D_DESC d;
					ZeroMemory(&d, sizeof(d));
					d.Width = width;
					d.Height = height;
					d.MipLevels = 1;
					d.ArraySize = 1;
					d.Format = DXGI_FORMAT_R8G8B8A8_UNORM;
					d.SampleDesc.Count = 1;
					d.SampleDesc.Quality = 0;
					d.Usage = D3D11_USAGE_DEFAULT;
					hr = context->d3d11Device->CreateTexture2D(&d, NULL, &texture);
				}

				NV_ENC_REGISTER_RESOURCE registerResource;
				memset(&registerResource, 0, sizeof(registerResource));
				SET_VER(registerResource, NV_ENC_REGISTER_RESOURCE);
				registerResource.resourceType = NV_ENC_INPUT_RESOURCE_TYPE_DIRECTX;
				registerResource.width = width;
				registerResource.height = height;
				registerResource.pitch = 0;
				registerResource.subResourceIndex = 0; // directx subresource
				registerResource.resourceToRegister = (void*)texture;
				registerResource.bufferFormat = NV_ENC_BUFFER_FORMAT_ARGB;
				nvStatus = encodeAPI->nvEncRegisterResource(encoder, &registerResource);


				// allocate output buffer
				NV_ENC_CREATE_BITSTREAM_BUFFER createBitStreamBuffer;
				memset(&createBitStreamBuffer, 0, sizeof(createBitStreamBuffer));
				SET_VER(createBitStreamBuffer, NV_ENC_CREATE_BITSTREAM_BUFFER);
				nvStatus = encodeAPI->nvEncCreateBitstreamBuffer(encoder, &createBitStreamBuffer);

				// create output event
				NV_ENC_EVENT_PARAMS nvEventParams = { 0 };
				SET_VER(nvEventParams, NV_ENC_EVENT_PARAMS);
				HANDLE hOutputEvent = CreateEvent(NULL, FALSE, FALSE, NULL);
				nvEventParams.completionEvent = hOutputEvent;
				nvStatus = encodeAPI->nvEncRegisterAsyncEvent(encoder, &nvEventParams);

				// add to input buffer queue
				MyBuffer* buffer = new MyBuffer();
				buffer->hInputBuffer = texture;
				buffer->registeredResource = registerResource.registeredResource;
				buffer->bufferFormat = NV_ENC_BUFFER_FORMAT_ARGB;
				buffer->hBitstreamBuffer = createBitStreamBuffer.bitstreamBuffer;
				buffer->hOutputEvent = hOutputEvent; // synchronous

				inputBuffers.push(buffer);
			}

			InitializeCriticalSectionAndSpinCount(&inputBuffersCriticalSection, 0);
			InitializeCriticalSectionAndSpinCount(&outputBuffersCriticalSection, 0);
			outputSemaphore = CreateSemaphore(NULL, 0, 16, NULL);

			fopen_s(&outputFile, "output.h264", "wb");

			// use ffmpeg to generate a proper mp4 file from the h264 stream, without re-encoding it:
			// ffmpeg -i "output.h264" -c:v copy -f mp4 "output.mp4" -y

			// create a thread to handle output
			CreateThread(NULL, 0, OutputThread, this, 0, NULL);
		}

		static DWORD WINAPI OutputThread(LPVOID lpParam)
		{
			((NvEncodeAsynchronous*)lpParam)->Loop();
			return 0;
		}

		// called in render loop
		void Next()
		{
			printf("Next\n");

			NVENCSTATUS nvStatus;

			if (inputBuffers.empty())
				printf("out of buffers----------------------------------------");

			EnterCriticalSection(&inputBuffersCriticalSection);
			MyBuffer* buffer = inputBuffers.front();
			inputBuffers.pop();
			LeaveCriticalSection(&inputBuffersCriticalSection);

			context->d3d11DeviceContext->CopyResource((ID3D11Texture2D*)buffer->hInputBuffer, inputTexture);

			printf("buffer copied\n");

			// map resource
			NV_ENC_MAP_INPUT_RESOURCE mapInputResource;
			memset(&mapInputResource, 0, sizeof(mapInputResource));
			SET_VER(mapInputResource, NV_ENC_MAP_INPUT_RESOURCE);
			mapInputResource.registeredResource = buffer->registeredResource;
			nvStatus = encodeAPI->nvEncMapInputResource(encoder, &mapInputResource);

			buffer->bufferFormat = mapInputResource.mappedBufferFmt;
			buffer->mappedResource = mapInputResource.mappedResource;

			// push output buffer
			EnterCriticalSection(&outputBuffersCriticalSection);
			outputBuffers.push(buffer);
			LeaveCriticalSection(&outputBuffersCriticalSection);

			// signal output thread that there is work to do
			ReleaseSemaphore(outputSemaphore, 1, NULL);

			printf("outputSemaphore released\n");

			// start encoding
			NV_ENC_PIC_PARAMS picParams;
			memset(&picParams, 0, sizeof(picParams));
			SET_VER(picParams, NV_ENC_PIC_PARAMS);
			picParams.inputWidth = width;
			picParams.inputHeight = height;
			picParams.inputPitch = width;
			picParams.inputBuffer = buffer->mappedResource;
			picParams.outputBitstream = buffer->hBitstreamBuffer;
			picParams.bufferFmt = buffer->bufferFormat;
			picParams.completionEvent = buffer->hOutputEvent;
			picParams.pictureStruct = NV_ENC_PIC_STRUCT_FRAME;
			nvStatus = encodeAPI->nvEncEncodePicture(encoder, &picParams);

			printf("encodePicture %d\n", nvStatus);

			QueryPerformanceCounter(&buffer->startingTime);
		}


		void Loop()
		{
			NVENCSTATUS nvStatus;

			while (true)
			{
				// block until we have work to do
				WaitForSingleObject(outputSemaphore, INFINITE);

				printf("outputSemaphore\n");

				// get output buffer
				EnterCriticalSection(&outputBuffersCriticalSection);
				MyBuffer* buffer = outputBuffers.front();
				outputBuffers.pop();
				LeaveCriticalSection(&outputBuffersCriticalSection);

				// wait until encoding is done
				WaitForSingleObject(buffer->hOutputEvent, INFINITE);

				printf("outputEvent\n");

				// elapsed time since submitted for encoding
				LARGE_INTEGER endingTime, frequency, elapsedMicroseconds;
				QueryPerformanceCounter(&endingTime);
				QueryPerformanceFrequency(&frequency);
				elapsedMicroseconds.QuadPart = endingTime.QuadPart - buffer->startingTime.QuadPart;
				elapsedMicroseconds.QuadPart *= 1000000;
				elapsedMicroseconds.QuadPart /= frequency.QuadPart;

				// copy bitstream data
				NV_ENC_LOCK_BITSTREAM lockBitstream;
				memset(&lockBitstream, 0, sizeof(lockBitstream));
				SET_VER(lockBitstream, NV_ENC_LOCK_BITSTREAM);
				lockBitstream.doNotWait = 1;
				lockBitstream.outputBitstream = buffer->hBitstreamBuffer;

				//context->renderLock.lock();

				//printf("before LockBitstream\n");
				//nvStatus = encodeAPI->nvEncLockBitstream(encoder, &lockBitstream);
				//printf("LockBitStream %d\n", nvStatus);
				//printf("output frame %d %d bytes %lld microsec\n", nFrames++, lockBitstream.bitstreamSizeInBytes, elapsedMicroseconds.QuadPart);
				////fwrite(lockBitstreamData.bitstreamBufferPtr, 1, lockBitstreamData.bitstreamSizeInBytes, outputFile);
				//nvStatus = encodeAPI->nvEncUnlockBitstream(encoder, buffer->hBitstreamBuffer);
				//printf("UnlockBitStream %d\n", nvStatus);

				//context->renderLock.unlock();

				nvStatus = encodeAPI->nvEncUnmapInputResource(encoder, buffer->mappedResource);
				printf("UnmapInputResource %d\n", nvStatus);


				printf("copied bitstream\n");

				// return buffer
				EnterCriticalSection(&inputBuffersCriticalSection);
				inputBuffers.push(buffer);
				LeaveCriticalSection(&inputBuffersCriticalSection);

				printf("returned input buffer\n");
			}
		}

		protected:

			NV_ENCODE_API_FUNCTION_LIST * encodeAPI;
			CRITICAL_SECTION inputBuffersCriticalSection, outputBuffersCriticalSection;
			HANDLE outputSemaphore;
			void *encoder;
			std::queue<MyBuffer*> outputBuffers;
			std::queue<MyBuffer*> inputBuffers;
			FILE* outputFile;
			int nFrames = 0;
			std::shared_ptr<Context> context;
			CComPtr<ID3D11Texture2D> inputTexture;
			int width, height;

	};
}

I’m not sure if you’re having the same problem as I (thought I) had but I’ll share it anyway.

I’m rendering some simple scene and then do the encoding (on GF 1080 GTX). I noticed that nvEncMapInputResource takes (when measured with DX11’s queries) 2 ms of GPU time. I would hope this function to return immediately. It turns out that for the simple scene I was reaching 1200 FPS and the hardware enconder couldn’t finish encoding in 1/1200th of a second. Apparently a full HD image takes the hardware encoder about 2 ms (1/500th of a second) to finish, hence the stall of 2 ms on call to nvEncMapInputResource in a subsequent frame. When I put some way more load on my GPU to have about 100 FPS I noticed that call to nvEncMapInputResource takes now about 0.15-0.25 ms. Still quite a lot for my taste but I won’t complain on that 10x improvement :).

Hm, now I think you have a different problem. I have also stumbled upon the problem of “Two threads were found to be executing functions associated with the same Device[Context] at the same time”. I think that you can’t unlock the resource on the reading/watching thread, you have to unlock on the thread that the resource was locked. Basically, only the wait procedure (WaitForSingleObject) should be called on a separate thread. I am not sure about copying bistream data - whether it can be done on the reading/watching thread.
Anyway, I myself didn’t need a second thread because I simply call WaitForSingleObject with 0 timout. And I check that each loop. If if returns WAIT_OBJECT_0 then I simply copy the bistream and umpa the resource on the main thread, what doesn’t really take much time.