D3D12 Range Profiler not collecting data from second command list

Hi,
I’m having trouble with using the range profiler with D3D12.

I am profiling one pass, and in this pass 2 command lists are executed. Each command list has multiple ranges I am trying to profile, however I am only receiving counters from the first command list, and no ranges from the second.

If I adjust my pass to only encapsulate the second command list, then I receive no data from any ranges at all when I decode the counters.

If I try to create two passes, one encapsulating the first command list and a second pass around the second command list, then I get an ‘insufficient space’ error from within the NV Perf SDK. I tried seeing if anything I could manipulate could affect when this insufficient space error occurred, but couldn’t seem to find anything that had an effect.

Any help would be greatly appreciated.

I managed to find a work around - by submitting end pass in a separate third command list, I got data from both the first and second command list’s ranges. Still unsure of the root cause of the issue, but nevertheless this fix works for my purposes.

Hi, Jamie,

Thanks for reporting the issue to us and it is good to know that you were able to find a workaround. We would still like to understand the problem so that we can help others. Is it possible to share your code with us so that we can debug this?

Here are some ways to share your files with us: 1) upload to either a Google Drive or another Cloud storage device that you prefer and provide us with access to the files; 2) if you are working with GitHub, you can also create a branch for us highlighting the issue.

Hi,

Here is my github repo; and this commit is where I implemented the workaround that fixed the issue. I apologise for the commit being a bit of a mess I had a lot going on at that moment.

Things of note: in NvGPUProfiler.h/cpp is where all things NV Perf are set up and handled. See SDFFactoryHierarchical.cpp 493:509 for the workaround that fixed the issue:

#ifdef ENABLE_INSTRUMENTATION
	{
		THROW_IF_FAIL(m_CommandAllocator->Reset());
		THROW_IF_FAIL(m_CommandList->Reset(m_CommandAllocator.Get(), nullptr));

		PROFILE_COMPUTE_END_PASS(m_CommandList.Get());

		THROW_IF_FAIL(m_CommandList->Close());
		ID3D12CommandList* ppCommandLists[] = { m_CommandList.Get() };
		m_PreviousWorkFence = computeQueue->ExecuteCommandLists(_countof(ppCommandLists), ppCommandLists);

		computeQueue->WaitForFenceCPUBlocking(m_PreviousWorkFence);
	}
#endif

If I place the end pass macro within the previous command list, I receive data from the first command list but not from the second command list. If I place an END PASS within the first command list, with no matching BEGIN PASS, and then place a BEGIN PASS in the second command list, with no matching END PASS, then I will receive data from the second command list but not the first.

If you need any more information regarding the project then I am happy to help.

Hi, Jamie,

Thank you so much for sharing your code with us. We’ll look at it and get back to you as soon as we can.

Hi Jamie,

Not sure if I read your code correctly. It seems the two command lists you mentioned were submitted to two different queues, one for the direct queue, and the other for the compute queue?

NvPerfSDK only supports profiling one queue at a time. It should be the queue that the application starts the session on(in your code, m_Profiler.BeginSession()). Even if the application starts a “compute pass” by “PROFILE_COMPUTE_BEGIN_PASS”, it actually still occurs on the direct queue. NvPerf SDK utility caches the queue you passed to the BeginSession, and use it for following APIs(this can be confirmed by placing a breakpoint at BeginPass(), in NvPerfRangeProfilerD3D12.h). And we don’t support submitting push/pop ranges on a non-profiled queue, such commands can be ignored.

Quoting from our Getting Started Guide:

ID3D12CommandQueue
* A profiling session can be started on a DIRECT or ASYNC_COMPUTE queue.
* Activity from all queues within the ID3D12Device may be measured.
* Non-profiled queues cannot push or pop ranges; such commands are ignored.
* Only one queue can act as the controller for a profiling session.
* Only one profiling session is supported at a time per GPU.

Moreover, when command lists are submitted outside of BeginPass/EndPass pairs(even to the profiling queue), those contained PushRange/PopRange will turn into NOPs, no data will be collected. This might explain some strange behaviors you were seeing, e.g. the BeginPass/EndPass that appeared on the compute queue was actually on the direct queue, so the window of direct queue executing BeginPass/EndPass may or may not overlap with the command lists you submitted to the compute queue.

To fix it, the general recommendation is to profile compute queue & direct queue separately, one queue at a time, with its own session. This also guarantees the separation of data from two queues, especially if their execution may overlap on GPU. Otherwise as mentioned by the Getting Started Guide, “Activity from all queues within the ID3D12Device may be measured.”

As for the “insufficient space”, I’m not entirely sure which API it came from. I could not grep the exact string in our API or utilities. If you still can reproduce this with the updated code, please feel free to post it here. Given there weren’t many metrics and there weren’t many ranges, I expect both the GPU & CPU memory usage to be minimun.

Thanks,
Yiran

If there is anything unclear in my last post, please let me know.

BTW, given your application is using ImGui, you can also consider integrating our HUD Counters utility: https://www.youtube.com/watch?v=0gpoWXpOadA, which is also based on ImGui. It will be yet another ImGui window added to your application. It can collect and render the GPU metric data at realtime through the HUD. If you’re interested in it, you can reference the sample D3D12\D3D12Multithreading_HUD.

Hi Yiran,

Apologies for the confusion. I should have noted that the way its set up is I can specify profiling configuration in a JSON file - which also includes which queue to profile and this is specified when the profiler is created and session started. The JSON config file to use can be specified in the command line arguments.

I have also set up the profiler such as when “PROFILE_BEGIN_COMPUTE_PASS” is called while the direct queue is being profiled, this results in no operation. This is all implemented in GPUProfiler.h/.cpp. I have verified that no passes are being started or ranges pushed/popped on compute queue when profiling direct and vice-versa.

I later discovered this is a simpler example of the strange behaviour I encountered, this time on the direct queue. If I place my profiling macros in the main render loop (D3DApplication::OnRender) as such:

// Begin drawing
m_GraphicsContext->StartDraw();
PROFILE_DIRECT_BEGIN_PASS("Frame", g_D3DGraphicsContext->GetCommandList());

// Tell the scene that render is happening
// This will update acceleration structures and other things to render the scene
m_Scene->PreRender();

// Perform raytracing
m_Raytracer->DoRaytracing();
m_GraphicsContext->CopyRaytracingOutput(m_Raytracer->GetRaytracingOutput());

// ImGui Render
ImGui_ImplDX12_RenderDrawData(ImGui::GetDrawData(), m_GraphicsContext->GetCommandList());

// End draw
PROFILE_DIRECT_END_PASS(g_D3DGraphicsContext->GetCommandList());
m_GraphicsContext->EndDraw();

and then I run the project in Profiling configuration with command line arguments --profile-config="profile_config/profile.json" --gpu-profiler-config="profile_config/direct_gpuprofile.json" --profile-output="captures/profile_direct.csv", then no data will be output to captures/profile_direct.csv.

However, if I reorder the macros in the render function to:

// Begin drawing
m_GraphicsContext->StartDraw();
PROFILE_DIRECT_END_PASS(g_D3DGraphicsContext->GetCommandList());
PROFILE_DIRECT_BEGIN_PASS("Frame", g_D3DGraphicsContext->GetCommandList());

// Tell the scene that render is happening
// This will update acceleration structures and other things to render the scene
m_Scene->PreRender();

// Perform raytracing
m_Raytracer->DoRaytracing();
m_GraphicsContext->CopyRaytracingOutput(m_Raytracer->GetRaytracingOutput());

// ImGui Render
ImGui_ImplDX12_RenderDrawData(ImGui::GetDrawData(), m_GraphicsContext->GetCommandList());

// End draw
m_GraphicsContext->EndDraw();

Then everything works as expected. I have verified this to be the case on the latest commit on the master branch, if you would like to check that the same behaviour occurs for you.

Thank you for the suggestion to use the HUD counters - however my goal is to output the collected metrics over a number of frames to a spreadsheet for analysis. Thankfully re-ordering the macros fixes the issues as far as my project is concerned anyway, but I’m happy to help provide any more info on the issue.

Thanks,
Jamie

Hi Jamie,

Thanks for your clarification!

One quick comment on your last example is, it’s fine if the command list that contains the PushRange/PopRange is recorded outside of the pass(formed by BeginPass/EndPass pair), but the ExecuteCommandList() that submits it must occur inside the pass, otherwise the PushRange/PopRange will be converted into NOPs(by design) as mentioned in my earlier post. Please note that BeginPass/EndPass are performed on the command queue, whereas PushRange/PopRange are typically on a command list(we do provide queue level PushRange/PopRange though), so the command queue doesn’t get to see the command lists nor the instrumentation inside it if they’re not executed/submitted to GPU.

In your code, if I’m reading it correctly, “m_GraphicsContext->EndDraw()” is where the command list gets executed, so in the below pattern

PROFILE_DIRECT_END_PASS(g_D3DGraphicsContext->GetCommandList());
m_GraphicsContext->EndDraw();

ExecuteCommandList() is actually after EndPass, thus no data will be collected.

I have quickly tried to flip the order of the two(in this case I also pass nullptrs to both PROFILE_DIRECT_BEGIN_PASS/PROFILE_DIRECT_END_PASS due to the order change), and I can collect data just fine:

	PROFILE_DIRECT_BEGIN_PASS("Frame", nullptr);

	m_Scene->PreRender();

	m_Raytracer->DoRaytracing();
	m_GraphicsContext->CopyRaytracingOutput(m_Raytracer->GetRaytracingOutput());

	ImGui_ImplDX12_RenderDrawData(ImGui::GetDrawData(), m_GraphicsContext->GetCommandList());

	m_GraphicsContext->EndDraw();
	PROFILE_DIRECT_END_PASS(nullptr);

I will investigate the rest issues you reported.

Thanks,
Yiran

(Continuing from last post)

I visited the compute side and the WAR(0d56577) you submitted. I think it’s exactly the same issue. Prior to that change, the EndPass doesn’t include the ExecuteCommandList(ECL), which would lead to no data being collected(please refer to my last post):

	PROFILE_COMPUTE_END_PASS(); // <---------------------------------------------
	PIXEndEvent(m_CommandList.Get()); // SDF Bake
	{
		// Execute command list
		THROW_IF_FAIL(m_CommandList->Close());
		ID3D12CommandList* ppCommandLists[] = { m_CommandList.Get() };
		m_PreviousWorkFence = computeQueue->ExecuteCommandLists(_countof(ppCommandLists), ppCommandLists);

		computeQueue->WaitForFenceCPUBlocking(m_PreviousWorkFence);
	}

I tired to remove your WAR and replace it with a single call to EndPass(), and it worked just fine:

	PROFILE_COMPUTE_BEGIN_PASS("SDF Bake", nullptr);

	BuildCommandList_Setup(pipelineSet, object, m_Resources);
	BuildCommandList_HierarchicalBrickBuilding(pipelineSet, object, m_Resources, maxIterations);

	{
		// Execute work and wait for it to complete
		THROW_IF_FAIL(m_CommandList->Close());
		ID3D12CommandList* ppCommandLists[] = { m_CommandList.Get() };
		const auto fenceValue = computeQueue->ExecuteCommandLists(_countof(ppCommandLists), ppCommandLists);

		// CPU wait until this work has been complete before continuing
		computeQueue->WaitForFenceCPUBlocking(fenceValue);

		..
	}

	{
		// Read counter value
		...
	}

	BuildCommandList_BrickEvaluation(pipelineSet, object, m_Resources);

	PIXEndEvent(m_CommandList.Get()); // SDF Bake
	{
		// Execute command list
		THROW_IF_FAIL(m_CommandList->Close());
		ID3D12CommandList* ppCommandLists[] = { m_CommandList.Get() };
		m_PreviousWorkFence = computeQueue->ExecuteCommandLists(_countof(ppCommandLists), ppCommandLists);

		computeQueue->WaitForFenceCPUBlocking(m_PreviousWorkFence);
	}

	PROFILE_COMPUTE_END_PASS(nullptr); // <--------------------------- Single line to EndPass

Collected results:

drops,0.5000,2837,258,SDF Bake, 818694, 32, 23, 45, 28, 28, 94, 2, 76, 26, 12, 28, 0, 13, 10, 13, 20
drops,0.5000,2837,258,SDF Bake/Edit Dependencies, 4667, 22, 32, 46, 21, 26, 96, 12, 100, 13, 5, 21, 0, 0, 18, 0, 0
drops,0.5000,2837,258,SDF Bake/Hierarchical Brick Building, 289176, 7, 55, 38, 18, 18, 97, 2, 99, 11, 6, 18, 0, 0, 75, 0, 3
drops,0.5000,2837,258,SDF Bake/AABB Building, 3778, 1, 12, 87, 1, 3, 77, 4, 100, 0, 0, 1, 0, 0, 48, 0, 21
drops,0.5000,2837,258,SDF Bake/Brick Evaluation, 218287, 94, 4, 1, 82, 80, 91, 2, 99, 82, 38, 80, 0, 48, 47, 48, 48
drops,0.5000,2837,258,SDF Bake/Hierarchical Brick Building/Brick Counting1, 114676, 7, 88, 6, 16, 16, 96, 1, 100, 12, 6, 16, 0, 0, 47, 0, 0
drops,0.5000,2837,258,SDF Bake/Hierarchical Brick Building/Prefix Sum1, 4704, 0, 1, 99, 0, 0, 14, 1, 100, 0, 0, 0, 0, 0, 65, 0, 0
drops,0.5000,2837,258,SDF Bake/Hierarchical Brick Building/Brick Building1, 6722, 5, 63, 32, 10, 6, 77, 1, 100, 9, 10, 6, 0, 0, 84, 0, 0
drops,0.5000,2837,258,SDF Bake/Hierarchical Brick Building/Edit Culling1, 14907, 7, 66, 27, 20, 21, 97, 6, 99, 11, 5, 20, 0, 0, 76, 0, 1
drops,0.5000,2837,258,SDF Bake/Hierarchical Brick Building/Brick Counting2, 52120, 4, 37, 59, 10, 10, 94, 1, 100, 7, 3, 10, 0, 0, 49, 0, 0
drops,0.5000,2837,258,SDF Bake/Hierarchical Brick Building/Prefix Sum2, 8444, 0, 3, 97, 0, 0, 29, 0, 100, 0, 0, 0, 0, 0, 72, 0, 0
drops,0.5000,2837,258,SDF Bake/Hierarchical Brick Building/Brick Building2, 7352, 16, 57, 27, 33, 22, 82, 2, 100, 28, 33, 22, 0, 0, 83, 0, 0
drops,0.5000,2837,258,SDF Bake/Hierarchical Brick Building/Edit Culling2, 31528, 28, 62, 10, 71, 71, 97, 7, 100, 32, 17, 71, 0, 0, 83, 0, 1

In your use case, I feel it’s more natural to always pass nullptrs to BeginPass/EndPass which underneath uses the queue-level push/pop range. Because otherwise, it’s missing a ECL:

void NvGPUProfiler::EndPassImpl(ID3D12GraphicsCommandList* commandList)
{
	if (!m_Profiler.AllPassesSubmitted() && m_Profiler.IsInPass())
	{
		if (commandList)
			PopRangeImpl(commandList);   // <-------- Missing ECL(commandList) between this line and EndPass(), and it doesn't make much sense to execute it here either.
		else
			PopRangeImpl();

		THROW_IF_FALSE(m_Profiler.EndPass(), "Failed to end a pass.");

		m_DataReady = true;
	}
}

(What queue-levle PushRange/PopRange does it, the NvPerfSDK DLL will internally create a command list, adding the push/pop range instrumentation, and ECL it for you)

Should you have any other questions, please let me know!

Thanks,
Yiran

Hi Yiran,

Ah I see, my mistake was I thought begin and end pass should be called within the command list else they would no-op, similar to push-pop range. However that does make a lot of sense!

Thank you very much for digging through my code, and I apologise that it was just my mistake. Your explanations have been very helpful and thank you for taking the time to help.

Many thanks,
Jamie

No problem at all! Glad to know the issue has been resolved.

Thanks,
Yiran