I am planning to purchase a Supermicro server that contains four C2050’s. The OS is Windows HPC Server 2008 R2 (there is no forum for that). The host motherboard has dual 6-core Xeon CPUs (12 cores total) and dual chip sets that support 4 PCIe x16 slots, 2 slots per chip set, for the GPUs. The plan is to use MPI (MPICH2) to run five processes on this system, one master and four slaves. Each slave MPI process would have exclusive control of one of the four GPU’s. The result would be one GPU co-processor allocated per CPU slave process.
The reason for using MPI is that I am porting a code that already uses MPI. MPI starts processes for a high level of parallelism. OpenMP or CUDA starts threads for a lower level of parallelism. It is essential that this architecture be maintained.
It is not clear to me how best to do this. If I launch the MPI processes using mpiexec, found in MPICH2 or MS MPI, I can specify a host and tell it that I want five processes running on that host node. It is undefined what physical cores these processes will be running on, unless there is a convention of which I am unaware. Is there? I note that neither MPICH2 nor MS MPI have any way (via the mpiexec command) to set affinity to a particular processor.
Anyway, what I would like to do (I think) is to assign 3 processes (1 master + 2 slaves) to one 6-core CPU and 2 processes (2 slaves) to the second 6-core CPU. Splitting MPI processes between CPU’s would seem to be a way to get optimal throughput between the two GPU’s attached to each chipset and the associated processes on the two CPUs. Is it?
I suppose that the operating system (Windows HPC Server 2008 R2) could be smart enough to figure this out and optimize assignment of the CPUs to the MPI processes, but somehow I doubt it.
If you have any ideas on how this should or could work, and what I have to do to control it, please let me know. Thanks.