Multi-user-systems und multi-gpu-usage


we have a Tesla S870 in a multi-user-system and I don’t really know, how the access to the device is managed.

I know from Programming Guide, that is possible to use cudaSetDevice() to choose a GPU.
Do i have to tell everybody to use another device to share the resources of the S870?

Is it possible to use the same GPU by to different users?
Are there some integrated functions to contol user access ( Does the thread scheduler manage multi useraccess?)?

And finaly is there a kind of self-management that makes it possible to use all of the four GPUs without explicit calling?

Best Regards

I think it is possible to have multiple users. all those users can make use of the 4 Tesla’s but only one at a time. Lets say user1 uses Tesla1 user2 can use 2-4 to run his application on. It is not possible to run more kernels at once on one GPU so you need to manage that by yourself.

I should say Use 1 Telsa per user and tell the user he is only allowed to use his own Tesla.

Thank you for your quick answer!

That’s a pity.
Do you know if there exists some stack or something if two users try to launch kernels on the same GPU?

Unfortunately the chapter about device management in the Programming Guide is very short.
Do you know some other literature about this problem?

If I were you I just write to simple test programs and run them both at the same time on the same GPU to see what the outcome will be.

More than one process can access the same GPU at the same time. As long as the total memory allocated doesn’t exceed the free memory on the card, all apps will execute perfectly fine but at a much lower performance. So for testing and debug purposes, everybody using the same GPU isn’t a problem. But for performance tuning/real application execution you definitely want one process only on each GPU.

cudaSetDevice is the only tool you have to manage this :( Better tools have often been requested: here is hoping for them in CUDA 2.1.

In a production environment, job queues such as the sun grid engine or OpenPBS (normally tools used in cluster job sheduling) could be configured to schedule jobs onto GPUs.

But in a programming/test environment, communication between developers is probably the best way as setting up a PBS job script for every execution you want to debug would get tedious. One suggestion might be to leave GPU 0 as the test/debugging GPU and leave the other 3 for performance testing. Add a command line option for choosing the GPU early in development to make switching easy.

I’ve considered writing a gputop program that would let users know what GPUs are currently in use (there is a hackish way using lsof), but I haven’t gotten around to it.

That’s for sure. I tried some little programm to run in parallel an it worked. But I can’t draw my conlcusion only from a loss of speed about the organization of user management.

Perhaps its possible to find out this information with another test programm. But I think it’s little more difficult to write such a programm than to ask somebody.

Thanks, I will hope for the best in new CUDA versions.

Best Regards

Ok but that is very difficult. You cannot see how much memory is already allocated by someone else. The only way to do that is by calling that function that will check how much memory is still available on the GPU. The applications I’m developing are all using a lot of memory and therefore it is not possible for someone else to use the GPU. How would you get around that?

Thinking out loud -

Maybe an approach to the problem:
= Try to find out - before running the calculations - how much memory the task needs. Then you can allocate memory all at once, and use the available memory of the card as a sort of semaphore. Perhaps.
= An architecture that runs CUDA code asynchronously will probably utilise the machine+cards better - when the cards are used for many things at a time.

= There are probably some utilities out there to manage batch processing on ordinary clusters. Something that can manage batch processing on a 4-computer cluster can probably be adapted for job control on a 4-GPU “cluster”.

= Is there any chance to get each developer a cheaper CUDA card to develop on? They don’t have as much memory, but it may buy the administrator some time to get resource sharing things settled by relieving pressure on the TESLA machine.

Good luck!

In principle, an app already checks for error returns from every cudaMalloc. Thus the application will somewhat gracefully fail with an out of memory error message.

Ok thank you for this reply