Mixed Programing combining MPI and CUDA

Can anyone help me giving details regarding how to write a program combing both MPI and CUDA API’s?
I am new to this and need a starting point and i am very much interested in this.

It is possible and i have done this. But I think you should first learn CUDA and then MPI and then mix them both.

Hello sir,
I have learnt Both CUDA and as well as MPI.
I have done programming in both individually and now i heard that these both can be mixed up.
I request you to give me some guide lines and some direction to achieve this.
Thanking you for your kind reply .

Then, it is pretty straightforward.

As a first step, just print he CUDA devices listed in various nodes.

So, you run your job as “mpirun” – each node will find CUDA devices and print them.

Thats good enough to start with, isn’t it.

Now, you can write CU files and call kernels in all these nodes…

You need to divide your data-set based on the MPI Rank assigned to each node and so on.

Where you store the data (datasbase, distributed file system or Windows share or NFS share ) is all upto you…

Therez nothing that limits you from integrating MPI and CUDA that I am aware of. Its fairly straightforward.

They can’t be “mixed up”.

In some applications it makes sense to use both. I work with doing simulations on large domains where it is possible to use MPI with domain decomposition to distribute the computational domain over many nodes on a distributed memory cluster, and then use CUDA kernels and/or CUBLAS/CUFFT function calls to parallelize subdomain level computations on a companion GPU dedicated to each node. In embarrassingly parallel problems, or those with low interprocess communication overheads, like explicit time marching schemes, such an approach makes sense. But even then, the CUDA and MPI elements of the code aren’t “mixed up”, they are effectively totally separate.

MPI and CUDA are basically orthogonal parallel computing paradigms.

Dear sir,
Thanking you for your kind and quick reply to my problem.
I don’t see the same straightforwardness as you see because i have never written a program like this.
suppose i have called a kernel from the each node, how should I compile and run those programs?
Is there any material which clears all the doubts??
Is it possible to send me simple code which clears me so that i will understand better with the example.
Thanking you once again for your kind help.

What i meant is exactly what you are talking about. I have so many nodes and each node can call the kernel and run the program.
I feel more parallelism we will get this time.
But my doubt is how to go about doing this. I am very new to this. Can you suggest me some material or some simple code which can clear all the doubts in my mind.
I thank you for your kind help.

MPI_Init( (int *)NULL, (char ***)NULL);

	MPI_Comm_rank( MPI_COMM_WORLD, &myrank );

	MPI_Get_processor_name(computer, &length);

	if (IsCUDACapable() == true)

	{

		printf("%s: I am CUDA capable\n", computer);

		fflush(stdout);

		printf("%s: I have the following CUDA devices:\n", computer);

		fflush(stdout);

		printCUDADevices();

		fflush(stdout);

	} else {

		printf("%s: I am NOT CUDA capable\n", computer);

	}

	fflush(stdout);

	MPI_Finalize();

In the example above, each NODE prints the CUDA devices it has OR it says it is NOT cuda capable.

You just need to implement “printCUDADevices” in the usual way - Get the device count, and for all devices , get the props and print the name.

It must be fairly simple to implement the “IsCUDACapable()” function as well. Just get the device count and see if it is >1 – then sure capanble.

If 1 - if the props has “emulation” string in name then say NOT capable. Otherwise say capable…

That almost certainly isn’t what I am talking about. I am talking about a situation where you have

[list=1]

A distributed memory domain decomposition code which uses MPI for interprocess communications

Said code contains an embarrassingly parallel subdomain level workload

Said code runs on a distributed memory cluster where each node has its own CUDA capable GPU

In such a situation, it may then be feasible to parallelize the subdomain workload using CUDA code. In the resulting code, each node is simply offloading calculations from its local CPU onto its local GPU to increase performance at a subdomain level. The modifications to the existing distributed memory code can be very minimal indeed, and (from an MPI viewpoint) almost nothing changes.

I understood how to implement but what really bothers me using which complier i should use, if i use mpicc how to link the .cu file to this and vice versa.
I will try this simple program but can you tell me how to compile and the run the codes.
Please Try to understand, i am new to this kind of programming. How to link object code of one to another??
I am very grateful to you, because of this fruitful discussion so many of my doubts got cleared.
Only this big doubt is remaining. Can you guide me regarding this.

You can use the standard CUDA build process with some slight modifications to the SDK common.mk. All that is required is to provide extra paths and libraries to the C/C++ compiler and linker to add in MPI support. In most flavours of MPI I have used, mpicc is nothing more that a wrapper script which decorates any existing compile and link arguments with the necessary extras for the preprocessor to find the MPI headers, and for the linker to find the MPI libraries. Identical results can be achieved with the standard C/C++/Fortran compilers with the additional arguments to support MPI explicitly added.

I dont know what is MPICC.

But, MPI is just a communication library. Just compile normal CPP with MPI calls (just like the one I pasted above) and then link it to the MPI library available. In the same project use “CU” files and use NVCC to compile it. And the linking automatically happens as they are in the same project.

Thats all. I dont see a problem.

From MPICC man page, I found on net.

You may need to set “OMPI_CC” as “nvcc” so that it uses nvcc as the backend compiler.

That is only applicable to Open MPI. There are at least two other popular open source MPI stacks available (LAM and MPICH2), and several vendor implementations. All are different. Without knowing what operating system, compiler and MPI stack he is using it is impossible to provide specifics.

Even if he is using Open MPI, you definitely don’t want to change to default compiler to nvcc. That will certainly not work.

Dear sir,
Good morning. Thank you very much for your help. Any mpi program, we will compile using mpicc. thats what i meant mpicc.
Any C file that has mpi calls, we will compile using mpicc program_name.
I think mpicc is just a wrapper function which uses exiting compiler but links the mpi library.
I have written a .cu file having mpi calls which asks each node to give the count of the cuda devices.
So since it is .cu file so i have used nvcc program_name.
Since the program has mpi calls i have to link that library, so can you tell me i should i do that.

Dear Sir, Thank you very much for your kind help.
I have the following things after reading your reply.
Please suggest me what i have to do.

The program name is test.cu

#include<stdio.h>
#include<cuda.h>
#include “mpi.h”

int main(int argc, char **argv)
{
int MyRank, NumberOfProcs;
MPI_Status Status;
int Root = 0;
int Count, Device;
struct cudaDeviceProp Properties;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &MyRank);
    MPI_Comm_size(MPI_COMM_WORLD, &NumberOfProcs);

    if(MyRank == Root)
    {
            cudaGetDeviceCount(&Count);
            cudaGetDevice(&Device);
            cudaGetDeviceProperties(&Properties, Device);
            printf("I am processor with my rank %d has %d number of Cuda Devices and their names are %s \n",MyRank, Count, Properties.name);
    }
    else
            printf("I am processor with My Rank %d and printing HELLO WORLD \n",MyRank);

    MPI_Finalize();
    return(0);

}

I have compiled this by giving nvcc test.cu
Then it gave
test.cu:3:17: mpi.h: No such file or directory
Then i realized that i have to include mpi library
My mpi library path is
/usr/local/mpich2-1.0.7/bin/mpicc
So I request you to tell my how to link this path and make this program running successfully.
At present i have only one card on the root processor so that why i have written program like that.
Looking forward for your help.

You don’t need to compile that example with nvcc. Either compile it with the standard C build process in the SDK and add MPI paths and libraries to it - something like

-I/usr/local/mpich2-1.07/include -L/usr/local/mpich2-1.07/lib -lmpich

should probably do it, or provide the cuda includes and libraries to mpicc - something like

-I/usr/local/cuda/include -L/usr/local/cuda/lib -lcudart

So i have complied using cc complier thsi is the result

cc -I/usr/local/mpich2-1.0.7/include -L/usr/local/mpich2-1.0.7/lib -lmpich -I/usr/local/cuda/include -L/usr/local/cuda/lib -lcudart test.c
test.c: In function ‘main’:
test.c:11: error: storage size of ‘Properties’ isn’t known

When i have used nvcc compiler and linking the mpi library
I am getting the following error
nvcc -I/usr/local/mpich2-1.0.7/include -L/usr/local/mpich2-1.0.7/lib -lmpich test.cu
In file included from /usr/local/mpich2-1.0.7/include/mpi.h:1142,
from test.cu:3:
/usr/local/mpich2-1.0.7/include/mpicxx.h:26:2: #error “SEEK_SET is #defined but must not be for the C++ binding of MPI”
/usr/local/mpich2-1.0.7/include/mpicxx.h:30:2: #error “SEEK_CUR is #defined but must not be for the C++ binding of MPI”
/usr/local/mpich2-1.0.7/include/mpicxx.h:35:2: #error “SEEK_END is #defined but must not be for the C++ binding of MPI”

Why cant I use nvcc complier and link the mpi library, in the same way i am using cc complier and linking both the libraries.

So Can you tell me What i should do to make this program run properly.

Read the MPICH2 documentation. The SEEK_SET conflict between C++ stdio.h and the MPI version 2 standard is well documented, and several workarounds are offered in the documentation.

I have found the way
nvcc -I/usr/local/mpich2-1.0.7/include -L/usr/local/mpich2-1.0.7/lib -lmpich test.cu -DMPICH_IGNORE_CXX_SEEK
That’s the thing we have to give because of conflicts in versions( i am thinking)
I feel i can do some programming and get some results.
I thank you whole heartedly for your kind help.
Thank you very much once more.
Is it possible to send your mail id to meetpraveen_18@yahoo.com so that if have any doubts i can ask you.
Looking forward for your mail.