I have a dual boot system consisting of Windows XP (SP2) and Window 7 (Ultimate).
When I run my program on XP each iteration takes ~350mS (CUDA 2.2).
But when I tried it on Windows 7 it took 800mS and every 10 (or so) times it jumped up to 2000.
I’ve tried updating the CUDA to 2.3 and Visual Studio from 2005 to 2008 (express). Nothing I do seems to get the result I get in XP.
I installed the latest updates from Microsoft (even the one that came out on Nov 5th).
That’s a good question. In fact do Tesla cards need to use WDDM? There’s not even any display hardware on the cards.
I like that idea, to allow CUDA-only cards to be classified as non-video and therefore exempt from the WDDM abstraction.
I suspect this would make drivers ugly, though.
I wouldn’t even mind if there’s a hardware jumper or different BIOS on the card so that even the board ID could change and therefore even look like a different class of hardware to Windows when initially queried.
That’s actually dramatically better in Win7 versus Vista–I measured it recently and the per-allocation hit seems to be about ~100x faster (so it’s negligible now). The flat-rate overhead is about the same, though.
Why can’t we use some other interface alongside WDDM:
WDDM is a lot more than just a rendering interface. It manages all the memory on the device so it can page it in and out as necessary, which is a good thing for display cards. However, we get zero benefit from it in CUDA, because we have pointers! As a result, you can’t really do paging in a CUDA app, so you get zero benefit from WDDM. However, because it’s the memory manager, we can’t just go around it for CUDA because WDDM will assume it owns the card completely, start moving memory, and whoops your CUDA app just exploded. So no, there’s not really some magic workaround for cards that can also be used as display.
I like the way you think. Wouldn’t that also mean Remote Desktop just works with CUDA, then? And maybe no TDR timeouts that you can only disable with a system-wide registry key!
edit: also, just so I don’t sound like I’m preaching the end of everything, this varies a lot based on your usage pattern. We batch kernel launches to try to amortize as much of the WDDM overhead as possible. The problem comes in when you can’t really batch things–you do a kernel, wait for its result, and then conditionally do something else. At that point, no batching, significant launch overhead penalties (especially if you have a short kernel), and poor performance compared to XP/Linux.
So, uh, don’t write your apps that way if you can avoid it…
Many thanks for the explanation :)
I guess there is also a reason why you cannot tell WDDM to allocate (nearly) all GPU memory to CUDA application and then manage it internally without useless overheads?
So if I understand correctly, Windows Vista or Windows 7 both will not give me the entire RAM on the Tesla, will not give me all the speed up that Tesla could! (So I first pay for the awesome hardware and then pay for the OS to make it suck!) Further, if I need to use Nexus, I HAVE to use either Vista or 7.
Does Cuda 3.0 help in this matter? Is the next version of WDDM going to address this?
I understand that nvidia is not who is pulling the strings here, but some pointers as to whether this issue is one that will be resolved soon can help developers decide whether they want to shift to these OS’ or take a different path. Any pointers would be appreciated!
man, it would be really nice if we wrote a driver that worked with Remote Desktop and didn’t have these launch overhead problems and no timeout because that would be great, wouldn’t it? well, I beat you to it.
(I wouldn’t have moved to software if I couldn’t actually solve problems, guys :) )
edit: that’s a screenshot from my Mac connected to my dev machine connected to my test machine, just in case you were skeptical. Xzibit would be proud.
I didn’t understand the reason you gave why CUDA can not make use of the ‘Virtual-Memory’ feature of WDDM (at least for win-systems).
You mentioned something about:
WDDM is a lot more than just a rendering interface. It manages all the memory on the device so it can page it in and out as necessary, which is a good
thing for display cards. However, we get zero benefit from it in CUDA, because we have pointers!
Under WDDM, you can have more GPU memory allocated than can fit in its physical memory so long as the working set of a given rendering call is not greater than physical memory. Pretty straightforward–it tracks what resources a rendering call will use, pages in and out as necessary, no problem. This is a good thing for display cards, especially when the UI is 3D accelerated.
However, in CUDA, you can use pointers in device code, which means it’s completely impossible for us to tell what memory you’re actually using since the data structure you pass may include pointers to 5000 other regions in various places that have not been referenced since they were allocated. As a result, all of the memory allocated by that CUDA context must be present on the GPU whenever you run a kernel since the driver can’t tell what memory you plan on using.