Poor multithreading performance compared to DX12

Well, at least on current nvidia driver.

I made simple test program to gauge performance for both APIs (and dx9 for comparison).
It runs two different tests in succession, first one draws 20,000 of small quads to measure API call overhead, and second one draws Julia Set on a large quad of (somewhat animating) 125,000 triangles, to test shader execution performance.

Those look like this:
http://i.imgur.com/BbFY39v.png
http://i.imgur.com/e5HYOsq.png

Here are source code and binaries for those interested.
[Src] https://drive.google.com/open?id=0BzeNJCHJJEyjUTZDTmF6andRZUE
[Bin] https://drive.google.com/open?id=0BzeNJCHJJEyjeDVURWlTaWVBNWM

You will probably need Visual Studio 2015 redistributable package (https://www.microsoft.com/en-US/download/details.aspx?id=48145) to run the .exe.
If you want to compile the project, you should have Visual Studio 2015, LodePNG (http://lodev.org/lodepng/) and Vulkan SDK (https://vulkan.lunarg.com/).
I also used VLD (https://vld.codeplex.com/), but you can disable it by simply commenting out “#include <vld.h>” in WinMain.cpp.

Anyway, for julia set rendering, performance of both APIs are almost identical as expected.
But it wasn’t the case for the heavy draw call test.

With multithreading off, both APIs shows similar performance (about 300fps) on my system (i7 4770, geforce 980 GTX).
But with MT on, dx12 runs at 600fps but in vk it’s still the same 300fps, no performance gain whatsoever.

The problem is, even though both renderers were running at same 300fps in ST, GPU usage for dx12 was only 50%, while in vk it was well over 90%.
dx12 runs only at 300fps in this setup because of cpu bottleneck, busy to record and submit commands in ST, while in vk’s case it was already in gpu bottleneck situation, despite shader workload is minimum.
Hence, as soon as cpu bottleneck is alleviated by MT, dx12 shows huge performance leap while vk shows none.

I ran various setup(batch count, quad size, different shaders) and profilers to understand this situation.
And my conclusion is this:
vk can record and submit rendering commands very fast, even faster than dx12.
But for whatever reason, it has to impose heavier workload on gpu than dx12 for each API call.

As a result, with MT off, if you artificially setup the test for cpu bottleneck, by increasing batch count and reducing quad size, vk runs faster than dx12.
But if you make it more gpu intensive, by increasing quad size or with more complex pixel shader, dx12 quickly outperforms vk.
With MT on, dx12 runs always faster than vk, sometimes more than twice.

Microsoft’s GPUView also shows different characteristics of drivers for both APIs.
http://i.imgur.com/N4DDnZB.png
http://i.imgur.com/XharmXP.png

First one shows dx12 trace of drawing 8,000 batches, second is vk with same setup.
In “Hardware Queue” section, you can see small boxes stacked up.
Each one of those boxes is “command packet”, it is stream of api commands which driver sends to hardware for execution.
See wide horizontal blank spacing in dx12 trace, that’s gpu idle time and vk trace doesn’t have those.

There’s a difference of box dispostion too, in vk trace boxes are much smaller, and many.
If you click one of those boxes you can see basic information of that particular command packet.
Regardless of batch count setup, in dx12 command packet is uniform 32k bytes, while in vk it is rather small, and various in size (~2044 bytes).
http://i.imgur.com/uhsXxyS.png
http://i.imgur.com/9J92J0O.png

If this information is accurate, it means dx12 driver batches commands in large uniform packet, while vk driver behave somewhat differently.
Whatever it does differently to dx12 driver, it doesn’t look very effective.

Honestly I don’t understand why drivers for both apis have to behave so differently with significant performance gap, because to me both apis look damn close to each other.
Yes, this test is a extreme case and real world games won’t exibit this much performance differences.
But bottom line is, workload on gpu per api call is always higher in vk than dx12. And in today’s games, thousands of draw calls per frame is common.
Extra cpu overhead in dx12 can be mitigated by MT, but there’s no such option for extra gpu overhead in vk.

That’s somewhat disappointing as a developer who plans to implement new engine based on vulkan.
I’ll probably stick to vulkan because of it’s multiplatform nature and in my opinion it’s a bit cleaner api than dx12.
So hopefully future driver update will fix this issue.

From a quick look at the source code you are comparing two different things here. In Vulkan you use multiple secondary command buffers which are recorded in parallel and at the end you reference them in one primary cmd buffer which is submitted to the queue. The equivalent in DX12 would involve a call to ExecuteBundle. But you record multiple direct cmd lists and batch-submit them to the GPU. The obvious hurdle here is the render pass concept in Vulkan. Since a render pass cannot span multiple cmd buffers the DX12 way is not directly convertible to Vulkan. The only way to do this is to have multiple render passes (1 for begin, one for draw, one for end in your code) and use multiple primary cmd buffers that you also batch-submit. Obviously that would be bad on mobile, but DX12 doesn’t work on mobile in the first place. It would be interesting to see the performance of this test with multiple primary cmd buffers.

In my tests, secondary cmd buffer performance is worse on both CPU and GPU on the current NVIDIA driver. Not in the cmd buffer recording, but during queue submit. With heavy secondary cmd buffer usage i have seen submit times in the range of 2-4ms for about 10k secondary cmd buffers referenced by one vkCmdExecuteCommands call. Even if you leave out the recording of the secondary cmd buffers, the submit is so slow that just recording everything in one thread and submitting it is faster on CPU.

I have also seen the small packages in GPUView, but mostly in combination with rapid pipeline switches, like when forcing a pipeline switch between every draw call. But the optimizations between multiple secondary cmd buffers may be worse than just using one primary cmd buffer. You also use the VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT flag for both the primary and secondary buffers. I could imagine that the driver disables some optimizations if you use this flag to speed up recording/submitting.

Regards

You seem to focus on secondary command buffers as main culprit, but that is the hardly the case here.
I mean, even if I could use primary command buffers for MT, how would it help when gpu usage is already 95% in ST?
The problem is whether it’s primary or secondary CBs, execution of those on gpu is just plain slow in vk compared to dx12.

I don’t think dx12 bundle is equivalent to vk’s secondary CB either.
MSDN documents and other references I read made it clear main purpose of bundle is reusability.
That means there can be extra overhead in recording time if it can help to enhance the execution performance.
I didn’t use bundle here because it’s designed purpose doesn’t seem to fit here.

On the other hand, every vk samples and documents I came across used secondary CBs for MT.
Well, there’s practically no other option in vk because of render pass implicaton as you said.
If I have to specify redundant render passes (well that sounds quite like a hack to me) just because there’s performance issue on secondary CBs, well then I think it’s another issue that should be resolved by driver update.

You also said about submitting 10k secondary CBs but in my test total number of secondary CBs submitted is just 8 on my system, which is equivalent to number of hardware threads.
So I don’t think your case is directly comparable with mine either.
I also have heard the total number of command buffers (command lists) submitting should be in check for both APIs.

But again, with or without secondary CBs vk just excutes them slower than dx12.

As to VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT, I thought by specifying this driver can optimize a little bit in one time submission scenario, but I might be wrong.
Either way, enabling or disabling it shows no performance difference on my system, so I just left it there.

Cheers.

Okay, i have gotten your Vulkan code to work on my platform. My system is quite different from yours, so the timings are not compatible. I run Vulkan on Win7x64 on a stone age i7 920 2.6GHz with a GTX 970. In the single threaded case, my GPUview looks like your DX12. 3 normal packages, 1 present package and 2 sync packages. The order is different, but it is the same amount of packages. I also get the same image for MT. I tested with the 8000 quad case and get around 400 FPS with both ST and MT because i’m GPU bound with around 2.3ms per frame. But my CPU line in GPUview looks very different from yours. In ST the total time is around 1.8ms, in MT it is 0.9ms. With 2000 quads in ST i start to get equal CPU and GPU time.

Obviously i cannot test DX12 on Win7, but i can confirm from my own code that the “packet fragmentation” causes a massive GPU slowdown, quite equal to your 2x number on Win10. I missed that the fragmentation was also present in the ST case. But i have to go to great lengths to cause that in my code, like thousands of secondary CBs or PSO switches that change the tessellation shader between on/off every draw call.

Secondary CBs use the same mechanism as bundles, that is being called from a primary CB or direct cmd list. Secondary CBs without the one-time flag would be pretty much the same as bundles and are in fact advertised for reuse in the spec, just like bundles. The one-time flag obviously changes that, but DX12 has no equivalent. In a perfect world that flag would signal the driver if one wants a “DX12 bundle” or a “need MT recording in a render pass”. The flag isn’t a hint, it is a requirement. The CB becomes invalid after submit and must be rerecorded. For secondary CBs the flag says it can only be called in a single vkCmdExecuteCommands call. So in theory, the driver could literally copy/inline the secondary CB over. But yeah, sadly, from my experience secondary CB performance on NVIDIA is not quite there yet and better drivers would be nice. On the other hand, i found a lot of Vulkan samples to be quite poor. Most of them are for easy showcasing, rather than performance optimization, so i wouldn’t put much trust into them.

The redundant render passes are indeed a hack from a Vulkan view. But IHV/driver specific code paths are not really off limits in Vulkan. In fact the different queues between IHVs/drivers almost require it anyway. On desktop there are basically just 3 hardware platforms, and all of them are quite different. Sure, you can code against a generalized set of features, but that may be overkill. NVIDIA can’t really use Async compute until Pascal, and even there it’s not the same as on AMD. Integrated hardware from Intel/AMD needs completely different streaming, including the lack of a DMA queue so you need different MT too. And AMD has a huge gap between shader and geometry throughput, so they require custom compute shaders to cull geometry down to single triangles. So testing if redundant render passes on NVIDIA have no overhead (i haven’t tested that yet) and maybe restricting that “hack” to NVIDIA is not off limits in my book.

Regards

First, thank you for the testing.
I appreciate it.

But I’m still not convinced that bundle is equivalent of secondary CB.
See this:
https://msdn.microsoft.com/en-us/library/windows/desktop/dn899205(v=vs.85).aspx
At bundle creation time, the driver will perform as much pre-processing as is possible to make these cheap to execute later.

That means driver might spend much more time in recording just for relatively small performance gain in execution.
Why? In dx12 you can simply use regular command lists instead in one-time submit scenario without such limitations as vk presents.
That makes driver can focus on bundle’s main purpose - reusability - without concerning its recording time too much.

But in vk, well you only have secondary CBs.
Using redundant render passes just for MT might be OK to you, well I would use it too if it’s commercial product or something and it really make it faster, but think about this:
Both APIs are designed for multithreading from their root.
Do you honestly believe it was supposed to specify same renderpass again and again for MT in vk?
It’s nothing to do if those sample codes are reliable or not.
If secondary CBs are so slow so it’s practically useless for MT then there’s something wrong fundamentally.

But again, my whole point is hardly relevant to secondary CB’s performance.
Because even primary CB is DAMN SLOW compared to dx12.
I could try primary CBs in MT by specifying multiple renderpasses as you suggested, but what’s the point trying it when I already know vk is eating up 95% of gpu resources in ST while dx12 use only 50% in same scenario?

You said you are gpu bound with 8000 draw call in ST on your system.
Well it SHOUDN’T.
Your 970 GTX shouldn’t be limited at 400fps just drawing 8000 quads of 20x20 pixels with that simple fragment program.
You can even make those per fragment and rasterization work to practically none by adjusting quad size to 0. (“const float s = 10.0f;” in QuadPool.cpp)
I bet still you will have not much room left in your gpu. In my case usage was 75% or something while 35% or so in dx12.

I made this test program for two reasons.

  1. To study myself about these APIs.
  2. Just wanted to know if these APIs are really that efficient as they advertised.

I detected something’s wrong that one API is significantly slower than the other where (I think) it shoudn’t.
So I decided to report here.
It’s just that.

Cheers.

By the way,

Can you report the size of the packets in this case?
Thank you.

Well, if I was in charge of the spec or a driver, i would base my decision on the one-time flag, or have an additional optimize and/or inline flag in the API to separate the bundle vs. MT case. Obviously the render passes do hinder MT recording in some ways and secondary CBs don’t inherit enough state. I especially found the lack of inheriting viewport sizes from the primary CB, even if the viewport was dynamic in the PSO, to be quite limiting. But both primary and secondary CBs can be reused, and while DX12 has no information on that in advance, Vulkan does. So depending on the driver, Vulkan could spend that “extra CPU time to optimize” for any buffer, primary or secondary, that does not have the one-time flag. That is what i would expect, but clearly at least NVIDIA drivers don’t do that.

I wish i could test my own code base on Win10 to see if i get the same behavior as you do or compare Vulkan with DX12, but the privacy stuff in Win10 makes it basically impossible to use in my environment. That’s why my packet sizes are not going to help you. Win10 uses WDDM 2.0, Win7 uses WDDM 1.1, and by the looks of it they use a different model to transfer the data. The reported packet size on my main packet is just 12 byte. So i believe that it is just a reference and the actual packet is in memory. Memory is handled different in WDDM 1.1 and 2, so i cannot be sure. But you seem to have no allocation references, while my packet has over 30.

My question would be: is there a way for you to get rid of the fragmentation you see in ST Vulkan while keeping the explicit drawcalls? Without instancing or indirect draw of course. And then see how the timings compare with DX12.

Regards

OK. I see your point that vk driver could optimize more for execution performance at expense of recording time without one_time_bit flag.
Honestly I’m not sure if it’s intended by the spec or not because the document doesn’t make it clear how implementation would optimize for this or that with specified flags.
I also would like to point out that the reusability seems not really feasible anyway unless simultaneous_use_bit is specified.

I asked about the size of packets in gpuview trace because sometimes the information seems inaccurate.
(Oh now I can post images!)

This is my vk trace with 20,000 draw calls.
Number of packets looks similar to dx12 trace, but it reports 5 and 10 bytes for each packet’s size. I’m talking about standard dma packets here not present tokens etc.
If driver didn’t convert those direct draw calls to some indirect form as it liked somehow, it looks like false information.
Large chunk of packet information might be missing or it might simply report inaccurate packet size. I don’t know.

As to your question about the case with no packet fragmentation, well above screenshot might be one of them but also possibly inaccurate.
But regardless of credibility of gpuview trace in this case, dx12 was still faster in same ratio. (> 2 times)
In any of the cases I couldn’t find any packets larger than 2044 bytes nor any allocation references.
On the other hand, in dx12 it’s always reported as 32k uniform size.

Cheers.

I agree that explicit “optimize for this or that characteristic” would be better or should be additional to the flags Vulkan has at the moment. Especially since Vulkan needs to support many more platforms direct performance hints would make sense.
Simultaneous use doesn’t look like a requirement to me since you can manually double- or triple-buffer CBs instead on relying on the driver to do that. Since you need that for CPU writable buffers anyway, it’s not much work to do that for CBs as well.

Yes, the size of my standard DMA packages is around 8-12 byte. But i do always have around 40 memory references in those packages. But then again, different driver model, so comparing them is hard.

If the (maybe) non-fragmented case is still twice as slow as DX12, then it is a real bummer. I wouldn’t/can’t change to DX12, so being stuck with worse performance is quite sad. I hope NVIDIA will address this. Are you lucky enough to have an AMD card to test this on? The app i’m working on is restricted to NVIDIA until AMD gets around to implement sparse resources in their Vulkan drivers.

Regards

The spec made it clear without simultaneous_bit it is not allowed to submit a CB while in pending execution or resubmit it more than once.
I’m not sure how double/triple buffering CBs would help here.
Did you mean maintaining separate copies of CB for every submission?
If not, could you elaborate a little bit more?

I’d very like to know about the test results on AMD gpus too since I don’t have one.
If anyone can test it and share the result then that would be much appreciated.

Cheers.

I think you misread that part. The VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT allows you to resubmit a primary CB to a queue while it is still pending execution on that or any other queue. Without that flag, you need to sync with the fence from the vkQueueSubmit that started executing the CB before you can resubmit it. For secondary CBs they are pending execution as long as they are recorded into a primary CB. So as long as you sync with the fence, you can resubmit your CB afterwards. In my app i triple buffer CPU resources like mapped memory and the descriptor sets because they reference this memory. Since i cannot write into this memory as long as the CB that accesses it is pending execution, i need to sync with the fence anyway. So i can resubmit “static” CBs without the simultaneous_bit just naturally. The CBs that are recorded every frame are one-time submit and they are triple-buffered as well, as are their pools, so i can reset the pool after i sync with the fence. For secondary CBs without simultaneous use, your primary CB must be either a one-time submit and you can reuse the secondary CB after you sync with the fence and reset the pool or rerecord, or the primary CB must also be a “static” CB without simultaneous_bit and you can resubmit the primary CB after you sync with fence. Again in my app i record the next frame on the CPU while the current frame executes on the GPU with V-Sync. That is, after present is finished (which blocks on NVIDIA until V-Sync in FIFO mode) i directly submit the pre-recorded CBs for this frame and start recording for the next frame. So the CBs and CPU resources are triple-buffered, while the swapchain is double-buffered. Since i do not want to wait i acquire the next swapchain image only after i finished recording the next frame and then submit a static CB that does the final pass into the swapchain before presenting. In my tests no fence or acquire ever blocks with this setup and i do not need simultaneous_use anywhere.

Regards

I didn’t misread anything, you and I were just talking about different version of reusability.
What I meant by reusability I was talking about what dx12’s bundle presents, ability to submit as many times as you want without needing to sync with cpu.
You are talking about syncing with fence to guarantee the completion of execution on gpu before resubmission but you don’t even need bundle here as direct cmd list can do just same. See above link to msdn I posted earlier.

CBs with simultaneous_bit can do what bundle offers, but I’m not sure IHVs would feel comfortable enough to optimize aggressively for execution performance as much as they might do with bundle, because the vk’s spec is not very clear about this.

I also used double buffered command/constant buffers and descriptor sets in my test program with a few differences to your cases (struct PerFbData).
Since I only used immediate/mailbox present mode and I made it sure AcquireImage() doesn’t block in these setup, I didn’t need one extra buffer for those cpu resources.
I did need one extra semaphore which dynamically link to acquired image though.
Another difference is I record(update) cmd/constant buffers after AcquireImage(), because I cannot be sure of index to next image before the call and I know AcquireImage() wouldn’t block anyway as I already said.

Cheers.

I just realized I made AcquireImage() to not block with 0 timeout.
That way I could get next image’s index immediately and construct cmd/constant buffers etc based on that index.

I think in same way you could eliminate that one extra buffer with your fifo present mode too.
For safety you will probably still need one extra semaphore which dynamically link to the image and you also have to specify it as wait semaphores in submit info.

Cheers.

The Vulkan spec mentions that some implementations may need to inline patch CBs at submission and that using simultaneous_bit might lower performance. Since the spec was developed with every major IHV, i would bet that this is true for at least one IHV. So using a 1:1 mapping from DX12 for bundles would have lowered performance in that case. Vulkan offers more options because it spans a wider range of hardware, and i would assume that is also the case why performance hints for IHVs are rare. Since Microsoft developed DX12 also with IHVs and specifically designed bundles like they did, my wild guess is that some tiler on mobile has lower performance with simultaneous_bit. On the other hand you can also use primary CBs with the simultaneous_bit, no need for secondary CBs if you just want reuse without CPU sync. The one-time submit flag should make the optimization choice pretty clear for Vulkan: If one-time flag is set, do less optimization, if it is not set optimize fully. Now given, some devs might want to optimize CBs that they submit just once, so an explicit optimize flag would make this better. On the other hand it is harder to think of a case were you would want to reuse a CB (with or without CPU sync) but don’t want it optimized.

In the end i would also prefer an extra optimize flag for CB recording that makes the intend explicit to the driver. Preferably a “explicitly take more CPU time to lower GPU execution time” and a “explicitly take less CPU time even if it will raise GPU execution time” so the driver only needs to guess if neither flag is present. If someone from NVIDIA reads this: Extension or core Vulkan 1.1 please.

Now the AcquireImage is tricky. If you specify 0 timeout, how do you deal with a VK_NOT_READY return value? The problem here is that i want to keep the GPU busy. In FIFO mode on NVIDIA, present will block on the CPU until after V-Sync. I wrote about that in this forum because i would rather have the Acquire fence/semaphore block, but got no answer. This means after present returns, the GPU is idle. So if i now take say 10ms to record my CB for this frame, my GPU will be idle for this time. If GPU execution then takes 10ms as well, i miss V-Sync and present will block until the next V-Sync. So even though both CPU and GPU take 10ms each for a frame, they don’t work in parallel so it takes 20ms which is rounded up to V-Sync and i’m stuck at 30 FPS. Now in mailbox mode present never blocks. Not even with just 2 swapchain images (which the spec says behaves like FIFO, but on NVIDIA it doesn’t). Sounds good, but that means i have no control which images the user will actually see and which images will be discarded (so i wasted CPU and GPU time). With the 10ms example above, i render 100 images per second, but only 60 will be displayed, and they are chosen uneven. This means that the timing for any animation, be it camera or objects will move 10ms or 20ms for every 16.6ms the user sees, and it gets choppy and unsmooth. In my case this is not acceptable, so i have to use V-Sync. So to keep the GPU busy, i need to have 2 CBs in the working set, one CB is executed on the GPU while the other is recorded on the CPU. Thats why i submit before recording the next frame. Now if i had just 2, this means i would need to CPU sync right after present, to be able to record into the CB the GPU just finished executing. And my timing and profiling shows that this sync right after present takes up to 1ms. Acquire right after present has about the same stall. So i submit 2 times per frame, the first one right after present with the CBs i just recorded during the last frame. They render into off-screen buffers and do not need the swapchain index. Then i record the next CB, and the time this takes hides the sync latency. So after i record, i acquire, which now does not block, and then submit the second time a small static CB depending on the swapchain index that does the last pass from the offscreen RT into the swapchain. This static CB is only double buffered, one per swapchain image. This means i also have 2 sets of fences, one triple-buffered for the first submit and one double-buffered for the second. The way i have arranged my frame they never actually block (execution time of either fence wait or acquire is 1-3 µs). I have 2 semaphores, one for the “acquire <-> wait for second submit”, and the second one for “signal after second submit <-> wait for present”. With the extra semaphore do you mean you use 3?

Regards

How about acquiring the image’s index for NEXT frame BEFORE present() call?
That way you can start constructing everything for next frame right away, though you have to this in separate thread(s) since present() will block the current thread with your fifo setup.

Yep, I meant 3 semaphores in my case where 2 presentable images used.

Regards.

Oh never mind.
I think I just remembered things got ugly when I called AcquireNextImage() more than once a frame.
It indeed seems you are stuck in the situation where you have to employ some complicated solution with your fifo setup.

Cheers.

I just want to report this problem is now gone.
I’m not sure exactly when but nvidia finally corrected this issue.
With current driver, vulkan performs even faster than dx12 in this test. (700fps on my pc)

It seems it got slower in latest nvidia drivers again.

I tested multiple times but I can confirm that after 446.14 driver FPS has dropped by half.

What is going on nvidia?