DWM Crashes on Windows Server 2022 with A10 and Large Memory

I am working with a pair of new servers with Server 2022, each with 2TB RAM and a single A10 GPU with the 538.33 driver. We are running Remote Desktop Services for a multi-user environment for engineering applications.

We see a behavior where the 16th user logs in to the server, the login succeeds but the session create fails. In the Application event log we see events from Application Error reporting that DWM.exe crashed with exception code 0xe0464645. There is a event from dwminit indicating the DWM process has exited with the same code.

Findings from troubleshooting:

  • Removing the GPU eliminates the issue, but is not workable for the use case
  • Installing Server 2019 eliminates the issue, and we may go down this path, but would prefer Server 2022
  • Reducing memory to 500GB eliminates the issue, but limits server usage given some of the designs and number of users.

We’re doing a little more testing, but at this point it looks like a combination of the A10 GPU driver with Windows Server 2022 and a lot of RAM.

I’m looking for help understanding if this is a known issue and, if so, if there’s a known timeline to address, as this feels like an Nvidia issue.

Thanks!

Sounds like a resource handling issue. Needs to be investigated from enterprise support. As you should have vApps licenses for this you are eligable to open a support ticket.
Regards Simon

Hi @jfbradfo

We have been something similar. Also running RDSH. We get disconnects and problem with reconnecting showing black screens. Slow logons that can stall for minutes. Got the DWM crashes in the eventlog.

However we’re running on servers with 128GB RAM. For now we’ve disabled the A10s.

Did you find a solution?

-J

I can only repeat myself. Please open a support ticket to investigate the issue properly. There could be multiple reasons for this. Often it is a FB exhaust and then you need to reduce the amount of CCUs on the given machine.

First of all: I’ve already opened a ticket with Enterprise Support for the vGPU-solution. However, they referred me to this forum. The support representative said we wouldn’t receive support during the 90-day trial period, even though the registration email for the trial period stated otherwise. I do find it a bit odd to ask for support in a public forum for a €150,000 data center solution in the pre-sales phase, but okay, here we are. If it works in the end…

We are currently testing the vGPU software and the associated graphics cards for our physical terminal servers. Unfortunately, we are experiencing a problem with our physical terminal servers.

We have discovered that the DWM service (DWM.exe) crashes on the terminal server when a certain number of users are reached. Error code: 0xc00001ad
It doesn’t matter whether the user sessions on the server are active or disconnected. The actual load on the cards is also irrelevant. The error always occurs when a certain number of users are reached. We have already tested this with various cards (Datacenter with vGPU or Quadro).
To us, this looks like a fixed driver-dependent user limit or memory limit. We monitored the card load, which was approximately 1/3 memory load and approximately 20% GPU core load.

Particularly annoying: approximately 80% of our user sessions are inactive/disconnected, as users only log in once for time tracking. These sessions also count towards this limitation.

We tested this with an A2 and an RTX 2000 ada card.
The A2 card was able to handle 60 sessions and the rtx 2000 ada only 30, which makes no sense. The much faster card can handle only 50% less?!?

Can someone confirm that this error is caused by a limitation or something similar in the driver? If so, is there documentation on which card allows how many user sessions?

We plan to equip our data center’s terminal servers and our ESX servers with graphics cards. For financial planning and correct card sizing, we obviously need information about such limitations.

Hi Lars,

I’m wondering why you would expect to get Enterprise Support in an eval phase. Which other ISV would provide support during eval?
In general, your issue is a common issue and related to resource exhaustion of the OS. So there won’t be a resolution other than reducing the CCU count. Why would you add a GPU to RDSH if the purpose is time tracking ?
There are sizing guides available and the recommended CCU count for 16GB framebuffer is between 20-30 users.
60 users on RDSH with a GPU is way beyond technical limitiation.
If this is your expectation, I can only recommend to go without a GPU.

Best regards
Simon

“I’m wondering why you would expect to get Enterprise Support in an eval phase. Which other ISV would provide support during eval?”
Virtually every major hardware or software vendor offers end customers proof-of-concept (POC) trials and support during the POC phase for hardware and software…
But why am I complaining about support? We’ve been trying to order licenses for our vGPU option from nvidia for a year now. Unfortunately, we haven’t received any offers from our reseller. We’ll probably just remove the data center cards again and swap them with vendor-independent passthrough.

Since we recently purchased new ESX servers, which were equipped with the mandatory RTX Pro 4000 Blackwell, I took the liberty of testing them as well.

A2 16GB (Ampere): 60 users per card
RTX 2000 ADA 16GB (ADA): 30 users per card
RTX Pro 4000 (Blackwell): 10 users per card
→ +1 User and we get the famous 0xc00001ad error.

We have now successfully upgraded our two remaining GPU servers back to RTX 2000 ADA. Back to the future!