You will have to spawn multiple threads in some way. I’m not terribly familiar with MPI, but if you can use pthreads within a single MPI process, that will work. Start as many threads as GPUs, and have each one call cudaSetDevice with a different number.
That doesn’t make much sense. In the CUDA multi-GPU paradigm, each GPU requires a context and each context must be bound to an independent host CPU thread. There will probably be at least some requirement that those threads can communicate with one another. In a cluster of multi-GPU nodes, each GPU requires a context, each context must be bound to an independent thread and the threads probably have to communicate with one another to some degreee. It effectively the same requirements. You can use MPI in both cases (probably almost the same MPI code). On a local node, the message passing is probably happening via some fast shared memory IPC, and between nodes it is happening over the wire, but you code doesn’t have to care.
You could, of course, use some other host CPU threading mechanism at a node level, but you don’t have to. MPI will (and does) work well in that sort of situation.
I see, my concern is that the way people construct GPU clusters is that they have say 30 nodes, each node is a 8 core CPU (with shared memory) that connects to S1070( which has 4 GPUs ), so I don’t know how to use MPI to handle this kind of 2-level parallelization because the node id will be the same for 4GPUs associated with the same cpu node
You launch one MPI process per GPU context - each process has a unique node number and its own GPU context. It doesn’t matter whether you have four MPI processes with four GPU contexts on a single machine, or many spread across several cluster nodes. Each is uniquely identified in the MPI communicator.
The only downside in the single host case is that running four processes with MPI messaging is a little bit more CPU and resource intensive that running four host threads using a lightweight host thread manager (like pthreads). But in this context it really isn’t going to make much difference.
Are you confident your MPI setup works? Can you build and run a simple MPI “hello world”:
int main(argc, argv)
int rank, size, length;
printf("%s: hello world from process %d of %d\n", name, rank, size);
which should run and give you something like this:
[avidday@n0007 ~]$ mpiexec -np 6 src/mpihello
n0007: hello world from process 0 of 6
n0007: hello world from process 1 of 6
n0006: hello world from process 2 of 6
n0006: hello world from process 3 of 6
n0001: hello world from process 5 of 6
n0001: hello world from process 4 of 6