Remote Desktop Session Host GPU Load Balancing

Hi,

Just wondering if anyone is successfully using multiple GPUs on a Server 2019 Remote Desktop Session Host?

I’m in the process of testing a couple of new session hosts, I’m running them bare metal with Quadro P2200 GPUs (the session hosts aren’t expected to do any heavy stuff, just web browsing and documents - this is for a school so cost is a big factor).

Server 2019 is supposed to perform ‘Load balancing between multiple GPUs presented to the OS’ according to this:

So I’ve tried sticking both GPUs in one box to establish if this increases our user capacity relative to single GPU performance, and in the testing I’m getting some strange results.

Basically the test is that I log onto the session host and open internet explorer with a web page with a lot of active content (background video, spinny graphics…) which happens to be the school website, then repeat for as many sessions as it takes to saturate the CPU.

With no GPU installed, this happens at 4 users - with the single Quadro P2200 we get up to 11 users.

When running two Quadro P2200’s in the host, we get to nearly 20 users before the CPU is saturated, but you can see that the website is not rendering smoothly - there are stops and starts.

The issues become obvious looking at task manger - GPU 0 is running at:
95% 3d load
0% video encode
11% copy
52% video decode

where GPU 1 is running at:
14% 3d load
57% video encode
29% copy
2% video decode

So obviously the load being created is not being properly shared between the two.

Is there some way to make this work?

I could virtualize two session hosts on each physical host and pass through one quadro to each vm to workaround the problem, but the hope was to keep the setup required on each host as simple as possible.

Just to be clear we are using the vanilla RDP protocol for this, with AVC444 etc enabled via Group Policy.

1 Like

Heho,

have you any update to your problem, reached any conclusions?

Does nvidiaopenglrdp do anything to alleviate the choppy RDP experience?
Did you try to lift the 30fps RDP limit up 60fps?
Does AVC 4:2:2 instead of 4:4:4 anything in this regard?
Have you some idea how the picture changes between a bare metal deployment and a virtualized one, with DDA?
Did you try a consumer card like a 12GB RTX3060 in conjunction with the Quadro?
How bad is the 5GB memory limitation of the P2200 in a multi-Session RDSH enviroment?

Questions, questions, questions…

Greetings from Berlin.

Hi McKay, thanks for getting in touch.

To be absolutely clear, the experience is good and smooth when not deliberately overloaded (as I had done in the test).

I didn’t try nvidiaopenglrdp as I can’t say I was aware of it at the time, although I wouldn’t imagine any of the elements in the test were opengl, I suppose they could be.

I was able to up the RDP framerate on my windows 10 desktop to 60 fps using a registry tweak, i seem to recall recreating the tweak on a session server but finding it wasn’t effective - however I have a feeling that I was running Server 2016 at the time, where there might have been more luck with Server 2019. Ultimately it wasn’t a huge consideration as I doubted our users would notice.

I’m sure that AVC 422 would have some effect on the GPU capacity but I didn’t test for it - ultimately this mode looks worse so I was keen to avoid it (I did do some testing with linux based thin clients but results were unsatisfactory, mainly as they had to use CPU decoding).

We were using a virtualised session host before commissioning these dedicated physical servers, and it had a Quadro P2200 assigned via DDA on hyper-v - basically our testing showed that CPU limits could be fairly quickly reached on this session host, and we didn’t want to have this contesting with our other server roles that were on the same hypervisor.

The easiest and cheapest way for us to scale up was to build a couple of workstations with 3950x CPUs, 64gb ECC RAM and quadro P2200 gpus - our thinking was that this role did not warrant the same level of hardware as a hypervisor that might host core services such as domain controller or email server.

These have proved more than ample for our current thin client roll out (22 units), indeed in practice one session host of this spec is plenty if we want to take the other unit offline for maintenance and upgrades.

But yes, once we had concluded that the bottleneck most easily hit by the host was CPU usage it made sense to scale up with dedicated physical resources - this calculation will change for you depending on how many users will be on the session hosts and how much CPU availability you have on your hypervisors.

I did try a consumer card, but it was only a 4gb GTX 1650 - and it was better in terms of bang for buck, in the same test above I could have 9 clients on the website with Internet Explorer, versus 4 with no GPU, and 11 with the P2200.

It made sense at the time to stick with the P2200 as it left open the possibility of virtualising the session host(s) (although happily Nvidia have relaxed their previous restrictions and I am now using that 1650 with DDA in a virtualised desktop role) along with peace of mind that it was professional grade kit.

Which GPU makes sense for you will of course depend on what 3d work your users are doing - the actual 3d graphics needs of my users are minimal, hence the basic GPU chosen. You will need to choose your own tests if you want to establish your needs.

For our needs, 5gb of memory simply isn’t a limit, although we’re only usually talking about ten users at a time - usually when I check the GPU usage at this time the stats show perhaps 20% memory usage at most.

I do think the comparison between the 1650 and p2200 is somewhat interesting as in principle these are very similar graphics cards - the differences being 1gb of RAM and the limit to 2 video encoding streams on the 1650 - it isn’t clear which of the two made the difference so I would be cautious of assuming that a RTX 3060 will be capable of accommodating hugely more users, when it could be the limit hit in that case is the video encoding streams.

Certainly it would be capable of accommodating hugely more 3d processing from those users but it wouldn’t necessarily mean you could, for example, achieve 20 or 30 internet explorer users on my test.

Anyway, I might have some more time for testing now that we are coming into the summer holidays at last!