CUDA slower in Windows 7 than in Windows XP same computer, two OSs, different run times

GadiK · November 8, 2009, 5:27pm

Hi.
I’m using GTX 295 (Gigabyte) for my work. Here are the specs of my PC:

MOBO: Gigabyte GA-MA790XT-UD4P
CPU: AMD Phenom II X4 965
Mem: OCZ 4GB DDR3 1066

I have a dual boot system consisting of Windows XP (SP2) and Window 7 (Ultimate).
When I run my program on XP each iteration takes ~350mS (CUDA 2.2).

But when I tried it on Windows 7 it took 800mS and every 10 (or so) times it jumped up to 2000.
I’ve tried updating the CUDA to 2.3 and Visual Studio from 2005 to 2008 (express). Nothing I do seems to get the result I get in XP.

I installed the latest updates from Microsoft (even the one that came out on Nov 5th).

Is there any solution for Windows 7 users?

Thank you,
Gadi

Cygnus_X1 · November 8, 2009, 6:10pm

You are not alone. I observed that my program runs about 50% slower times on Windows 7.

tmurray · November 8, 2009, 7:11pm

Welcome to the fabulous world of WDDM launch overhead.

cbuchner1 · November 8, 2009, 7:29pm

Does it really have to be WDDM? Why can’t the CUDA specific pieces of the driver use a different kernel level interface?

You’d still get your WHQL if the graphics driver bits remains WDDM, right?

Christian

SPWorley · November 8, 2009, 8:47pm

Tim, that overhead applies just to the CUDA init time, right? Not every kernel launch?

Toolkit 3.0 helps a lot with init time overhead… does that help even more in W7?

I’m still based in Linux/XP … but dreading W7 because of WDDM. But won’t Nexus require WDDM and VS08? Argh!

SPWorley · November 8, 2009, 8:54pm

That’s a good question. In fact do Tesla cards need to use WDDM? There’s not even any display hardware on the cards.

I like that idea, to allow CUDA-only cards to be classified as non-video and therefore exempt from the WDDM abstraction.

I suspect this would make drivers ugly, though.

I wouldn’t even mind if there’s a hardware jumper or different BIOS on the card so that even the board ID could change and therefore even look like a different class of hardware to Windows when initially queried.

Simon_Green · November 8, 2009, 9:11pm

No, the overhead is for every kernel launch (same in Windows Vista). I believe it also depends on how many memory allocations you’re using. We’re working with Microsoft to improve this.

tmurray · November 8, 2009, 9:38pm

That’s actually dramatically better in Win7 versus Vista–I measured it recently and the per-allocation hit seems to be about ~100x faster (so it’s negligible now). The flat-rate overhead is about the same, though.

Why can’t we use some other interface alongside WDDM:

WDDM is a lot more than just a rendering interface. It manages all the memory on the device so it can page it in and out as necessary, which is a good thing for display cards. However, we get zero benefit from it in CUDA, because we have pointers! As a result, you can’t really do paging in a CUDA app, so you get zero benefit from WDDM. However, because it’s the memory manager, we can’t just go around it for CUDA because WDDM will assume it owns the card completely, start moving memory, and whoops your CUDA app just exploded. So no, there’s not really some magic workaround for cards that can also be used as display.

I like the way you think. Wouldn’t that also mean Remote Desktop just works with CUDA, then? And maybe no TDR timeouts that you can only disable with a system-wide registry key!

edit: also, just so I don’t sound like I’m preaching the end of everything, this varies a lot based on your usage pattern. We batch kernel launches to try to amortize as much of the WDDM overhead as possible. The problem comes in when you can’t really batch things–you do a kernel, wait for its result, and then conditionally do something else. At that point, no batching, significant launch overhead penalties (especially if you have a short kernel), and poor performance compared to XP/Linux.

So, uh, don’t write your apps that way if you can avoid it…

cbuchner1 · November 8, 2009, 11:34pm

Thanks for the technical insight, guys. Appreciate it.

Cygnus_X1 · November 9, 2009, 1:58am

Many thanks for the explanation :)
I guess there is also a reason why you cannot tell WDDM to allocate (nearly) all GPU memory to CUDA application and then manage it internally without useless overheads?

erdooom · November 9, 2009, 2:48pm

seems MS folks need to add “non-paged” memory for gpus, tell the os not to mess with this chunk of the memory and then not to check any thing when this kernel or shader is launched.

erdooom · November 9, 2009, 2:50pm

oh and my app is about 50% slower on vista and “only” 25% slower on 7 weeeeeee

tanmay.Learns · November 9, 2009, 6:36pm

So if I understand correctly, Windows Vista or Windows 7 both will not give me the entire RAM on the Tesla, will not give me all the speed up that Tesla could! (So I first pay for the awesome hardware and then pay for the OS to make it suck!) Further, if I need to use Nexus, I HAVE to use either Vista or 7.
Nice going. External Image

Does Cuda 3.0 help in this matter? Is the next version of WDDM going to address this?

I understand that nvidia is not who is pulling the strings here, but some pointers as to whether this issue is one that will be resolved soon can help developers decide whether they want to shift to these OS’ or take a different path. Any pointers would be appreciated!

tmurray · November 9, 2009, 7:21pm

man, it would be really nice if we wrote a driver that worked with Remote Desktop and didn’t have these launch overhead problems and no timeout because that would be great, wouldn’t it? well, I beat you to it.

(I wouldn’t have moved to software if I couldn’t actually solve problems, guys :) )

edit: that’s a screenshot from my Mac connected to my dev machine connected to my test machine, just in case you were skeptical. Xzibit would be proud.

erdooom · November 9, 2009, 8:36pm

Dont u know its not nice to tease ?!

tmurray · November 9, 2009, 8:37pm

Patience is a virtue… :) (would I be talking about it if it were six months out?)

HannesF99 · November 10, 2009, 8:02am

Hi tmurray,

I didn’t understand the reason you gave why CUDA can not make use of the ‘Virtual-Memory’ feature of WDDM (at least for win-systems).

You mentioned something about:

WDDM is a lot more than just a rendering interface. It manages all the memory on the device so it can page it in and out as necessary, which is a good
thing for display cards. However, we get zero benefit from it in CUDA, because we have pointers!

can you explain me more detailed ?

tmurray · November 10, 2009, 8:18am

Under WDDM, you can have more GPU memory allocated than can fit in its physical memory so long as the working set of a given rendering call is not greater than physical memory. Pretty straightforward–it tracks what resources a rendering call will use, pages in and out as necessary, no problem. This is a good thing for display cards, especially when the UI is 3D accelerated.

However, in CUDA, you can use pointers in device code, which means it’s completely impossible for us to tell what memory you’re actually using since the data structure you pass may include pointers to 5000 other regions in various places that have not been referenced since they were allocated. As a result, all of the memory allocated by that CUDA context must be present on the GPU whenever you run a kernel since the driver can’t tell what memory you plan on using.

GadiK · November 10, 2009, 8:48am

Hi guys.
I didn’t really get an answer to my question.

Is there some solution for making CUDA run as fast on Windows 7 as on Windows XP?

(BTW did someone try to compare running CUDA on Linux VS. on XP?)

Thanks.

Sarnath · November 11, 2009, 5:15am

There should be a way to expose TESLA as a non-graphic card… That might help…

Topic		Replies	Views
Performance difference between Tesla and system where Cuda GPU is used as display device CUDA Programming and Performance	8	5904	September 2, 2009
Computing with Geforce CUDA cards CUDA Programming and Performance	18	4987	March 3, 2014
Which GPU for best performance with TCC and CUDA cores (no tensors) CUDA Programming and Performance	30	186	December 6, 2024
CUDA on Windows much slower than on linux CUDA Programming and Performance	5	3468	January 26, 2013
First kernel execution takes longer CUDA Programming and Performance	8	2852	December 8, 2014
CUDA hardware & software CUDA Programming and Performance	9	2664	November 13, 2010
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1858	January 12, 2019
Multiple users running CUDA WinXP CUDA Programming and Performance	22	6942	June 10, 2008
Windows or Linux for CUDA CUDA Programming and Performance	11	22611	March 21, 2011
why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time CUDA Programming and Performance	18	2344	January 9, 2017

CUDA slower in Windows 7 than in Windows XP same computer, two OSs, different run times

Related topics