Hi I am planning to use tera grid cluster(…64TeslaCluster) to run my Monte Carlo fortran codes,with CUDA C,using fortran wrapper.I am planning to run it on multiple nodes,each node having some tesla cards.MPI is used to communicate between processors.But I am confused that,when I run my code,each processor will need a GPU to run it’s part of the CUDA C code,hence each might compete for a particular Tesla card.How can I have control over this competition.Is there any way to do that?

In your MPI machine file, specify single process to be run per machine with GPU card attached?

Can you please provide a sample file with the instruction.Thanks a lot for your answer

I can’t, as I have no clue either about the MPI installation type (MPICH/OpenMPI/…?) on this cluster, nor about cluster nodes/network configuration. But you should be able to find about machine files (and how to point to one, when starting your MPI program) in your MPI installation documentation, and you should then talk with sysadmin(s) of this particular installation about nodes available for your work, their configuration etc., so that afterward you should be able to prepare machine file.

Principally, machine file is simply a list of nodes, one machine listed per line of this file, on which to run your MPI program; so my initial suggestion was to just put each machine with a GPU attached into this file once, as this way each process launched will know it has exclusive access to the GPU on the corresponding machine; this is probably too simplistic for what you eventually intend to do, but at least it should be something to start with.

Side note: I think you should not crosspost your questions.

Thank you very much.I am new to the forum,I will not crosspost again.Thank you.

Your link is broken

I assume you are going to run on NCSA’s lincoln cluster (

On that machine, cpu core: gpu =8:2 for each node.

If each of your MPI process uses one GPU, then you probably want to run two MPI processes on each node.


Yes I am using the same cluster,which you have stated.I will try out your suggestion.Thanks a lot.Can you provide me some script file to run on Lincoln cluster with MPI and Cuda C.

Do I need to include some logic in my code ,like cudasetdevice() as well so that there will be one to one mapping with processor and GPU??


Yes, you have to include some logic inside the code so that it would be easy to map.

It can be like this:

If there is a card connected to that processor ( this information can be got from cudagetdeviceproperties and other run time routines), the job can be submitted to GPU. If it is not there it will be run by the processor itself or if they are more than one card connected to that machine, then by cudasetdevice routine we can map each processor to one particular card which gives us better results.

we have see how many gpu’s are there on that machine, if the machine is quad core but only one card is connected to that machine, then it wont be much useful to us. Because we can run only one kernel on the card at a time, so all kernels submitted by each processor will be in queue which will not give us good results.