Multiple users running CUDA WinXP


We are here a team of people trying to use CUDA. We have a windows XP box with 8800 GTX installed on it.

Earlier, I was using RealVNC from my laptop to access and run programs in this machine.

But now that our team is increasing – We want multiple users to login to the machine and run programs, analyse etc… We dont have any terminal server installed. BUt then – I think Terminal servers required RDP and RDP and CUDA do NOT co-exist. So, windows server 2003 is also NOT an option.

How do we go about this? Any clues?

Linux seems to be a good alternative. But then – Linux does NOT have high resolution timer APIs for its applications – which is kindaa sad. We have to manually use “rdtsc” to do our job.

Anyway, I would prefer to use Windows because our prospective clients use Windows.


Best Regards,

I would not advise you to have multiple users on a windows box. It just does not work, the scheduler is already not capable of handling 1 user with compute-intensive & interactive tasks, let alone more than 1 user (at least on XP)

What do you call high-res timers? I believe linux has them already for a while?
[url=“The high-resolution timer API []”][/url]

Thanks for the link. The link talks about these timers in kernel. Here is the comment by the author (Corbet) himself in the comments column of the same page:


I suppose I could have said something about that… hrtimers are used now for the implementation of POSIX timers and for the nanosleep() call, so, in that sense, yes they are available to user space.

The other thing which I really should have mentioned (I did in an earlier article) was that, in order to provide truly high resolution, you also need a high-resolution clock within the kernel. Current kernels still do not have that, so the hrtimer interface still works with HZ resolution - 4ms on i386 with the default configuration. There are a few high-resolution clock patches around, mainly tied to John Stultz’s low-level clock rework; something should get merged before too long, I would think (but not for 2.6.16).


All we need is a small kernel patch. But I dont know why linux people r staying averse to it.

The worst case would be like “queryPerformanceCounter()” in Windows, which I think , just uses “rdtsc” to do the time in the user-space itself. This one includes context-switch, interrupt time etc…

But getting a kernel interface would really be great and ACCURATE!

Any1 knows if a HighRes Timer patch got in the latest linux kernel releases?

well the link is from >2 years ago, so I think current kernels have hrtimer support (it went in in 2.6.16, I believe 2.6.26 is close to being released)

The only thing I found about how to use them:
For more accurate times, gettimeofday() is accurate to about a microsecond

And this page: [url=“”][/url]

(Got back to posting )

For a WinXp box you could deploy the service application in a DCOM environment.

Multiple users (remote) could easily access your service in a Multi Threaded Apartment. Only a singleton would service CUDA launches and queue the requests.

For multiple GPU-s this would work great as a single RPC thread itself would be responsible for selecting a device and executing a kernel.

Remote transfer of kernel could happen by nvcc calls on target machine and custom configuration on the host machine.



Thanks for your suggestions on the DCOM component. You had talked about it in an earlier forum entry and I have taken that into account. That is one reason why I want the development to happen in Windows. btw, I am yet to figure out what a DCOM/COM components are…

My question is NOT about how multiple CUDA kernel launches will be handled. I think the driver automatically handles all that and I hope the driver would scale across GPUs. Can some1 from NVIDIA talk about that? I think people had been cribbing about lack of available interfaces to the application to schedule their kernel on an appropriate available GPU. I just assume that if these interfaces are NOT exported to applications – the driver would take care of this by itself. Can some1 from NVIDIA enlighten me here?


My question is how multiple users can access the GPU machine to run their programs and NOT on how multiple CUDA programs would run on a single node having some GPU cards on it.

The problem is RDP does NOT work well with CUDA. So, we have to use VNC. ANd again, for multiple users to see an individual desktop for themselves, windows should run Terminal server. XP does NOT come iwth Terminal server.

So my questions are

  1. Is there a version of VNC that works well with Windows Terminal Server?

    winVNC?? Not sure what it is… I had come across it before somewhere…

  2. Is Terminal Server package available on XP?

  3. Is CUDA compatible with Windows Server 2003? Because this version windows

    is bundled with Terminal server?

Can some1 enlighten here?

THank you

The driver will handle very well multiple applications executing on CUDA device 0, although they will run slowly if executed simultaneously. To execute your application on a device other than 0, see cudaSetDevice() in the programming manual. Yes, this means you must manually manage which device your application runs on. CUDA gives you complete control over choosing which GPU to run on.

I’ve had the most success on XP with RealVNC, although I have never tried it with CUDA. Why don’t you just try it and see if it works for you? I don’t even know if multiple users can use VNC simultaneously. Doesn’t it control the main display on the machine?

As you have identified it, your real trouble here is that windows XP is a single user OS. I don’t know if CUDA is compatible with windows Server 2003 / terminal server.

I’ll just repeat what others have said here and say that Linux + sshd will easily allow you to have multiple users sharing one machine. If you object to a text-based command line, then each user can run their own VNC session and have their own graphical desktop. Setting up a RHEL (or CentOS) linux install and putting CUDA on it only takes ~30 min, so there is no harm in trying it. You could even shrink the XP partition and install Linux next to it so you can always go back if you need to.

Thats baffling… How will I know which GPU is busy and which is NOT?? I will have to launch my app from a Shell script which will give a GPU number based on a round-robin or whichever available way as an argument to all my GPU programs…

But this means everything has to be invoked from a common entry point…

And, that entry point will NOT help Multi-GPU applications. haaa…

Thats crazy.

Why cant the driver schedule it for us?

RealVNC works like GEM with CUDA. I always use it. Today, for a change, I just tried RDP and found that my GPU programs fail (Insufficient memory, wrong results etc.)

Installing terminal services would require RDP as the client. And RDP fails miserably with CUDA. Hmm… I think it is the other way around. The Terminal server is NOT compatiable with CUDA (not RDP – which is actually a client program).

So, Terminal server is completely ruled out.

Thanks for this!! I have started contemplating about this.

Do you have any idea if RDP would work if I make some other NVIDIA graphics cad (a low end one) as the primary display? Currently my GTX is the primary display. If that works, I would go for installing terminal server on XP or migrate to Windows server 2003 and see if it helps.

Thanks for all your help!

I did try “gettimeofday()” before. I dont think I got any consistent good results like what I get in windows.

They are just using HZ to do that. I dont think it can go to sub-millisecond.

It is such a simple thing to implement in kernel.

Actually, both gettimeofday() and queryPerformanceCounter() are real PHYSICAL clocks. It would include interrupt, context switch and other overheads. One needs to be careful with them.

On the other hand, if an OS implements process-times based on high res timers – it would really be accurate!

Best Regards,


As a sample of one, I don’t want to see the driver automatically choose a device to run on. 1) In a system with a slow display card and a fast compute card, I wouldn’t want to be at the mercy of the driver as to whether my app is blazing fast or as slow as the CPU. 2) Even if all devices in the system are equal, some may sit on fast PCIe slots and some on slow ones. I would want control.

That being said, I wouldn’t object to a way to request how many processes are using each GPU so that I can select the least used when running man batch jobs. I have a hack to do this on linux, but a CUDA API call would be much cleaner.

I doubt it. As far as I understand it, RDP completely removes access to all display drivers except its own virtual one. Since CUDA needs to work through the display driver, it won’t be able to access it.

gettimeofday() doesn’t use the Hz timer on any linux distro I’ve tried it on. I’m not sure what it uses, but hundreds of repeated calls to it result in “jumps” of 1 us. So while gettimeofday isn’t required to provide any particular granularity, I’ve always found it to be very fine.

And there are advantages to reading a “real” clock such as gettimeofday. On multiprocessor/multicore systems, you can’t trust the processor clock count as your process may have been moved to the other CPU between measurements!

Not to mention the fact that when you request which GPU to use GPU0 might be very busy, but after 10 seconds it might be idle. You first have to choose a GPU, then move data, perform your calculations, and then move your data again.

I would guess that choosing your GPU based on how much free memory is available on it, is may be a good & simple way to make your choice. If you have equivalent cards, the one with the most memory available is also probably the least busy.

And also powersaving modes of the CPU make the clock count non-monotonic.

Why didn’t I think of that! Sometimes the simplest ideas are the best ones. Oh well, its time to go add a --device=most_mem_free option to my application.

mmm, I’m curious about tesla now, How are “remote access” and “multi-users” happening in tesla based machines?


Good one!! Just like what Mr.Anderson said!!! I could figure this out in run-time.

Instead of maximum memory free – I would probably look at maximum % of memory free. no?

If the heuristic that the application that is using the minimum memory is most likely to exit early holds water, I think we have a cooool idea here.

Thanks for this input!!


The application could spawn as many number of threads as there are GPUs. Each thread would launch a dummy kernel (that probably would take a few micro-seconds) and then return when the thread is done. The GPU corresponding to the thread that returned first could be used!!! A “cudaMemcpy” instead of a “kernel” is also a possibility. But there are cards out there which can overlap CUDA computation and memory copies. So that idea is ruled out.

Best Regards,


There is a problem associated with both these methods.

There could be race conditions among applications.

The only way to solve is: to write a middle ware library that would maintain state information along with the methods listed above. And, all CUDA applications have to link to this library. But again – I am not sure, if Libraries can maintain common state among applications. I guess thats not possible… Any1 has any idea on this? Otherwise, driver is the only place where I can maintain common state for all apps…


Of course there can be race conditions if multiple users are submitting more jobs than there are GPUs. But then if you want optimum performance you only want one thing running on a GPU at a time. So the elegant most free memory technique will work as long as the 2nd job isn’t started until after the first has gotten running.

I wouldn’t write any fancy CUDA library that uses a bunch of hack methods to find if the GPUs are used or not. If you need that kind of functionality, wait until CUDA supports it (someone should request the enhancement in the bug report tool).

For now if you need rock solid scheduling of many jobs on many GPUs use a standard job queueing software such as the Sun grid engine or OpenPBS. Both can be configured to allocate custom resources to jobs and there is probably a way (with some scripting) to pass the allocated GPU number to the command line of the application. A job queue would be ideal even if you did have a system where you could request GPU usage built into CUDA as there could still be race conditions (i.e. job 1 starts and doesn’t actually use the GPU for several minutes. Then job2 starts and grabs the same GPU because it is unused according to any check that can be made).

Since there is no precise way of knowing a gpu is loaded , I still think centralizing the control under one application (no driver) is a good idea. I am still figuring out an efficient way to deal with remote requests for scheduling on the GPU.

Hmmm if you can come up with an alternative solution , do post it here!



Windows2003 supports CUDA, atleast thats what my experience is with that, but, what will you achieve even if you have Windows 2003 running and CUDA on it?. Still how that will solve the problem of multiple users with remote access to the CUDA machine?.
As someone pointed out earliar that when do RDP into a windows XP or windows 2003 box, then the display driver which comes into picture is a Remote Display driver for RDP and they dont load nvidia driver and so when the driver is not loaded, you will not be able to create a CUDA context.
So another thing you can try is to install cygwin prepare it for a ssh. So once done, all your users can do a login via ssh and then can get the CUDA device access. But yeah if the app are graphical in nature then it will not be a good solution, but if you guys are using GPU as a number cruncher, then this might work.



RDP was the problem that I was mentioning before too that makes Windows 2003 less suitable… I think the SSH idea looks great! I have SSHed to windows boxes before. So CUDA on it should be a simple thing!


Best Regards,