MPI and CUDA mixed programming General CUDA programming

Has anyone thought about using MPI in conjunction with CUDA?

Assume that either the problem is straightforward to decompose for parallel execution or that I have available very fast switching.

The answer to your question is yes. Was that the question you meant to ask though?

CUDA and MPI are orthogonal. You can use MPI to distribute work among multiple computers, each of which uses CUDA to execute its share of work. I’ve a tiny cluster, which I’m using to do just that.

Paulius

Sweet! I thought so also, but as I have no hardware right now (we are trying to go through the process of ordering :( ) I cannot test myself. I still have to convince those with cash to do the right thing.

David

Are you so kind in sharing an example of Makefile which compiles an MPI source code integrated with CUDA ?

I am also interested in this kind of programs and I would like a start point for my tests.

Thanks.

Gabriel

Indeed, another way to paraphrase this is ‘use MPI for the coarse-grained parallelism among the nodes, and do whatever you want within the node by pushing the GPU(s) and the CPU cores to the limit’. The GPUs are programmed with CUDA. This is btw a ‘minimally invasive’ approach, and I’ve been pursuing this with conventional offscreen GL rendering to accelerate things for a while now.

Things get funky if your parallel (in the MPI sense) application requires a lot of communication and if your local work isn’t decoupled enough. As an example, consider a parallel domain decomposition approach to a linear system solver. You’ll be hitting the PCIe bottleneck pretty fast, at least for sparse matrices. This is an active field of research, and I’ve published some papers on this.

Note that parallel linear system solving is sort of a worst case scenario, many other applications perform much more local work so that the data transfer from MPI via the PCIe interface to the CUDA device is not a prominent bottleneck.

To get back to the topic of this thread: Linking CUDA and MPI together is pretty trivial, at least it worked right out of the box for me :)

I started with an MPI program, and “outsourced” the main computation parts onto the GPUs. Basically for each GPU I use one CPU. This is for multiple GPUs in one computer and/or multiple computers.

This was rather straightforward. I just used the makefiles from the examples and added the mpi library. Thats it.

best regards
Christian Mueller

You guys are saying that compiling a mixed code of MPI and CUDA is trivial and should work OOTB, but it seems that I can’t figure it.

I have this small Makefile which should compile a mixed MPI and CUDA code.

CC=nvcc

CFLAGS= -I/usr/local/mpich2-1.0.6.p1/include -I/usr/local/cuda/include -I/home/user/NVIDIA_CUDA_SDK/common/inc

LDFLAGS= -L/usr/local/mpich2-1.0.6.p1/lib -L/usr/local/cuda/lib -L/home/user/NVIDIA_CUDA_SDK/lib -L/home/user/NVIDIA_CUDA_SDK/common/lib

LIB= -lcuda -lcudart -lcutil -lm -lmpich -lpthread

SOURCES= Init.c main.c

EXECNAME= Exec

all:

        $(CC) -o $(EXECNAME) $(SOURCES) $(LIB) $(LDFLAGS) $(CFLAGS)

But in fact I get this errors:

nvcc -o MatrixProduct CalcCirc.c Init.c main.c  -lcuda -lcudart -lcutil -lm -lmpich -lpthread -L/usr/local/mpich2-1.0.6.p1/lib -L/usr/local/cuda/lib -L/home/user/NVIDIA_CUDA_SDK/lib -L/home/user/NVIDIA_CUDA_SDK/common/lib -I/usr/local/mpich2-1.0.6.p1/include -I/usr/local/cuda/include -I/home/user/NVIDIA_CUDA_SDK/common/inc

In file included from Init.c:12:

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:123: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutReadFilef’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:139: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutReadFiled’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:155: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutReadFilei’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:170: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutReadFileui’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:186: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutReadFileb’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:202: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutReadFileub’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:216: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutWriteFilef’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:230: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutWriteFiled’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:242: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutWriteFilei’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:254: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutWriteFileui’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:266: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutWriteFileb’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:278: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutWriteFileub’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:294: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutLoadPGMub’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:307: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutLoadPPMub’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:321: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutLoadPPM4ub’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:337: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutLoadPGMi’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:353: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutLoadPGMs’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:368: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutLoadPGMf’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:380: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutSavePGMub’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:392: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutSavePPMub’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:405: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutSavePPM4ub’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:417: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutSavePGMi’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:429: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutSavePGMs’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:441: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutSavePGMf’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:462: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutCheckCmdLineFlag’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:476: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutGetCmdLineArgumenti’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:490: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutGetCmdLineArgumentf’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:504: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutGetCmdLineArgumentstr’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:519: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutGetCmdLineArgumentListstr’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:533: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutCheckCondition’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:545: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutComparef’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:558: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutComparei’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:571: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutCompareub’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:585: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutComparefe’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:600: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutCompareL2fe’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:613: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutCreateTimer’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:622: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutDeleteTimer’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:630: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutStartTimer’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:638: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutStopTimer’

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:646: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘cutResetTimer’

Init.c: In function ‘CUDAInit’:

Init.c:62: error: ‘cudaDeviceProp’ undeclared (first use in this function)

Init.c:62: error: (Each undeclared identifier is reported only once

Init.c:62: error: for each function it appears in.)

Init.c:62: error: expected ‘;’ before ‘deviceProp’

Init.c:63: error: ‘deviceProp’ undeclared (first use in this function)

Init.c:75: error: expected ‘;’ before ‘}’ token

Init.c:120: error: expected declaration or statement at end of input

make: *** [all] Error 255

Anyone has any idea why ?

Thanks

Ok for the above errors it seems it was my fault, because I didn’t rename the CUDA containing file using the .cu extension and that’s why the nvcc compiler wasn’t preprocessing it and was treating it as a normal C code passing it directrly to gcc.

Now I encounter another problem during my compile process.

In mpicxx.h I receive errors saying that support for exception handling is disabled.

Running nvcc with the -v switch, I see that cudafe is called using parameter “–no_exception”.

How can I change this parameter and tell cudafe not to use it ?

Try adding

--host-compilation=C++

to your nvcc call. If you have C++ code in your .cu, you’ll need it.

when i use mpich-1.2.7 erery think work fine when compile and link

but wheni run program, the cudaMalloc not work after MPI_Init.

so i change to mpich2 and now find a new error

error: expected an identifier

did you have any hints?

thanks advance

I have just successfully ran an MPI program that uses CUDA. I set it up so that I have a 2-node cluster, with a combined total of 10 CPUs. I launched 10 instances of MPI (1 per CPU) using mpdboot and a host file. My program is a simple MPI program that executes 1 process per CPU, and each of those processes calls a CUDA kernel.

This works because each of the CUDA cards in the cluster supports the “Default” compute mode, where a single card can be used simultaneously by multiple processes. However, running more kernels than there are CUDA cards is probably not the best practice.

Also, there are a total of 5 cuda cards in these 2 nodes, and is it stands all the processes running on the same host will execute on the same card, leaving all secondary and tertiary cards idle.

This makes me wonder if there is a CUDA-aware Torque / Maui module or plug-in out there, that can handle launching only as many kernels on a box as cards it has, and setting the active cuda card differently for each process on a specific box. It could possibly alter the data used in invoking the kernel so as to tune the program for the different cards. This may not be a good idea, however, as it gets in to how to actually split up the job, which is historically the job of the programmer and not the tools.

Also, MPI supports different processes talking to each other and syncing on each other, which is currently impossible to do with CUDA. This is due to the fact that not all blocks in a specific CUDA program are executing simultaneously if, say, there are more blocks that need to run than there are multprocessors available to run the blocks. So, the blocks themselves can’t talk to each other on the same host, and the threads within a block certainly cannot talk to a thread on a different host through some sort of MPI call. “Fixing” this problem probably is impossible as it it tied pretty directly with the architecture of the graphics chip. Plus, GPUs don’t know anything about network cards, so it would have to notify the CPU to run an MPI call, and then wait for the network lag, and the whole thing would be a disaster.

So, I guess what I’m trying to say is that with the current technologies, it only makes sense to use CUDA and MPI together if your problem has extremely fine-grained parallelism and ONLY has extremely local data dependencies (or no dependencies at all, which is the best case). This really narrows down the space of problems that this technology can attack. On the other hand, it can still do some very cool things.

For people searching on the net, here is what I did to make my simple MPI and CUDA application:

kernel.cu:

[codebox]#include <stdio.h>

global void kernel(int *array1, int *array2, int *array3)

{

    int index = blockIdx.x * blockDim.x + threadIdx.x;

    array3[index] = array1[index] + array2[index];

}

extern “C”

void run_kernel()

{

    int i, array1[6], array2[6], array3[6], *devarray1, *devarray2, *devarray3;

    for(i = 0; i < 6; i++)

    {

            array1[i] = i;

            array2[i] = 3-i;

    }

cudaMalloc((void**) &devarray1, sizeof(int)*6);

    cudaMalloc((void**) &devarray2, sizeof(int)*6);

    cudaMalloc((void**) &devarray3, sizeof(int)*6);

cudaMemcpy(devarray1, array1, sizeof(int)*6, cudaMemcpyHostToDevice);

    cudaMemcpy(devarray2, array2, sizeof(int)*6, cudaMemcpyHostToDevice);

kernel<<<2, 3>>>(devarray1, devarray2, devarray3);

cudaMemcpy(array3, devarray3, sizeof(int)*6, cudaMemcpyDeviceToHost);

for(i = 0; i < 6; i++)

    {

            printf("%d ", array3[i]);

    }

    printf("\n");

cudaFree(devarray1);

    cudaFree(devarray2);

    cudaFree(devarray3);

}

[/codebox]

mpi.c:

[codebox]#include <mpi.h>

void run_kernel();

int main(int argc, char *argv)

{

    int rank, size;

MPI_Init (&argc, &argv); /* starts MPI */

    MPI_Comm_rank (MPI_COMM_WORLD, &rank);  /* get current process id */

    MPI_Comm_size (MPI_COMM_WORLD, &size);  /* get number of processes */

    run_kernel();

    MPI_Finalize();

    return 0;

}

[/codebox]

now, for compilation:

[codebox]$ nvcc -c kernel.cu

$ mpicc -o mpicuda mpi.c kernel.o -lcudart -L /usr/local/cuda/lib -I /usr/local/cuda/include

[/codebox]

and running it:

[codebox]$ mpirun -l -np 10 ./mpicuda

1: 3 3 3 3 3 3

9: 3 3 3 3 3 3

8: 3 3 3 3 3 3

2: 3 3 3 3 3 3

7: 3 3 3 3 3 3

6: 3 3 3 3 3 3

0: 3 3 3 3 3 3

4: 3 3 3 3 3 3

5: 3 3 3 3 3 3

3: 3 3 3 3 3 3

[/codebox]

I don’t think the schedulers really know about GPUs yet, but you can at least set the GPUs to be ‘compute-exclusive’ (search the forums for this and ‘nvidia-smi’).

It’s not a disaster at all. By your argument doing any parallel programming will be a disaster! You just use the GPUs to make each node faster. When they need to synchronise, they copy the relevant data back to their hosts, and the hosts do the necessary synchronisation over MPI. How well will this work? That’ll depend on how tightly coupled your problem is, and to what extent you want/can pipeline operations on the GPU. Altough the Cell is slightly different, check out the discussion of Reverse Acceleration talk given at the NCSA Accelerators meeting.

Where exactly is the problem? We do “cudaSetDevice(rank%4)” for an S1070 based cluster all the time… (and that’s all that needs to be done) rank%2 for my workstation with a GTX280 and 8600GT. Guess it’s easy because we do not hybridly parallelise. We are sure however that er get nodes exclusively to ourselves, so no hassle with scheduling systems.

If you want to use GPUs with a queuing system, set the exclusive mode and then use them as consumable resources ( similar to what you do with licenses).

I am trying to complie a CUDA.MPI code.

nvcc -c cudampi.cu -I/home/delta1/abc/NVIDIA_CUDA_SDK_2.1/common/inc

mpicc -o mpicuda cudampi.c cudampi.o -lcudart -L /usr/local/cuda/lib -I /usr/local/cuda/include

However the responding error is

:">

“/usr/bin/ld: cannot find -lcuda”

Its kind if strange, I have done above on the same computer, I have passed successfully last time.

Can anyone help me? thanks a lot.

You can google this very easily.

Nonetheless, it seems to me like your ‘-lcudart’ should be on the NVCC line, not the MPICC line ?

Nope. The nvcc line is only compiling, the mpicc line is linking. Libraries have to be provided to the command that performs the linking.

That error doesn’t make sense. The linker will only report not finding -lcuda if you specified it during linking. Are you sure that you transcribed the error message correctly? For reference, here is what compiling with MPI (this is MPICH2) and linking with the CUDA version 3.0 runtime should look like;

avidday@cuda:~$ nvcc -c -arch=sm_13 jacsub.cu -o jacsub.o

avidday@cuda:~$ mpicc -o testcudampi testcudampi.c jacsub.o -L/opt/cuda-3.0/lib64 -lcudart 

avidday@cuda:~$ ldd testcudampi

	linux-vdso.so.1 =>  (0x00007fff6cf32000)

	libcudart.so.3 => /opt/cuda-3.0/lib64/libcudart.so.3 (0x00007f013e3b6000)

	libpthread.so.0 => /lib/libpthread.so.0 (0x00007f013e19a000)

	librt.so.1 => /lib/librt.so.1 (0x00007f013df92000)

	libc.so.6 => /lib/libc.so.6 (0x00007f013dc20000)

	libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f013d913000)

	libm.so.6 => /lib/libm.so.6 (0x00007f013d68e000)

	libdl.so.2 => /lib/libdl.so.2 (0x00007f013d48a000)

	libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007f013d272000)

	/lib64/ld-linux-x86-64.so.2 (0x00007f013e5f1000)

You can see that there is no dependency on the CUDA driver library (libcuda.so) in the executable…

a wild guess, may you need use

-L /usr/local/cuda/lib64

?

Hi,

I am doing a project that has two compute nodes with Gpu’s inside and one frontend without a gpu in it. We are using rocks cluster 5.3 as our software and we installed cuda on all the three machines. We are implementing a math problem and We are using mpi to parallize and cuda on GPU’s. Actually we are learning mpi and cuda and for just a sample program to run we used ur sample program posted above. When we run the code we are getting an error if we use -l in the option to execute the mpicuda. The error we are getting if we use -l is

[codebox]

[cudauser@frontend cudampi]$ nvcc -c kernel.cu

[cudauser@frontend cudampi]$ mpicc -o mpicuda mpi.c kernel.o -lcudart -L /usr/local/cuda/lib -I /usr/local/cuda/include

[cudauser@frontend cudampi]$ mpirun -l -np 10 ./mpicuda


mpirun was unable to launch the specified application as it could not find an executable:

Executable: -l

Node: frontend.local

while attempting to start process rank 0.

[/codebox]

and if we dont use -l in the execution command we are getting the following output which is different from your output

[codebox]

[cudauser@frontend cudampi]$ mpirun -np 10 ./mpicuda

-1 10874816 134513652 10876504 -1076555184 10820776

-1 10874816 134513652 10876504 -1076013248 10820776

-1 10874816 134513652 10876504 -1077478304 10820776

-1 10874816 134513652 10876504 -1081820496 10820776

-1 10874816 134513652 10876504 -1080191616 10820776

-1 10874816 134513652 10876504 -1078355488 10820776

-1 10874816 134513652 10876504 -1074501952 10820776

-1 10874816 134513652 10876504 -1077205840 10820776

-1 10874816 134513652 10876504 -1078776560 10820776

-1 10874816 134513652 10876504 -1075930256 10820776

[/codebox]

Can u please help us with this issue.