MPI and CUDA mixed programming General CUDA programming

gerdw · March 15, 2007, 2:22pm

Has anyone thought about using MPI in conjunction with CUDA?

Assume that either the problem is straightforward to decompose for parallel execution or that I have available very fast switching.

paulius · March 15, 2007, 9:39pm

The answer to your question is yes. Was that the question you meant to ask though?

CUDA and MPI are orthogonal. You can use MPI to distribute work among multiple computers, each of which uses CUDA to execute its share of work. I’ve a tiny cluster, which I’m using to do just that.

Paulius

gerdw · March 15, 2007, 11:21pm

Sweet! I thought so also, but as I have no hardware right now (we are trying to go through the process of ordering :( ) I cannot test myself. I still have to convince those with cash to do the right thing.

David

gabriel_ro · April 4, 2008, 4:00pm

Are you so kind in sharing an example of Makefile which compiles an MPI source code integrated with CUDA ?

I am also interested in this kind of programs and I would like a start point for my tests.

Thanks.

Gabriel

dominik · April 4, 2008, 10:15pm

Indeed, another way to paraphrase this is ‘use MPI for the coarse-grained parallelism among the nodes, and do whatever you want within the node by pushing the GPU(s) and the CPU cores to the limit’. The GPUs are programmed with CUDA. This is btw a ‘minimally invasive’ approach, and I’ve been pursuing this with conventional offscreen GL rendering to accelerate things for a while now.

Things get funky if your parallel (in the MPI sense) application requires a lot of communication and if your local work isn’t decoupled enough. As an example, consider a parallel domain decomposition approach to a linear system solver. You’ll be hitting the PCIe bottleneck pretty fast, at least for sparse matrices. This is an active field of research, and I’ve published some papers on this.

Note that parallel linear system solving is sort of a worst case scenario, many other applications perform much more local work so that the data transfer from MPI via the PCIe interface to the CUDA device is not a prominent bottleneck.

To get back to the topic of this thread: Linking CUDA and MPI together is pretty trivial, at least it worked right out of the box for me :)

ceearem · April 7, 2008, 2:45pm

I started with an MPI program, and “outsourced” the main computation parts onto the GPUs. Basically for each GPU I use one CPU. This is for multiple GPUs in one computer and/or multiple computers.

This was rather straightforward. I just used the makefiles from the examples and added the mpi library. Thats it.

best regards
Christian Mueller

gabriel_ro · April 9, 2008, 1:39pm

You guys are saying that compiling a mixed code of MPI and CUDA is trivial and should work OOTB, but it seems that I can’t figure it.

I have this small Makefile which should compile a mixed MPI and CUDA code.

CC=nvcc

CFLAGS= -I/usr/local/mpich2-1.0.6.p1/include -I/usr/local/cuda/include -I/home/user/NVIDIA_CUDA_SDK/common/inc

LDFLAGS= -L/usr/local/mpich2-1.0.6.p1/lib -L/usr/local/cuda/lib -L/home/user/NVIDIA_CUDA_SDK/lib -L/home/user/NVIDIA_CUDA_SDK/common/lib

LIB= -lcuda -lcudart -lcutil -lm -lmpich -lpthread

SOURCES= Init.c main.c

EXECNAME= Exec

all:

        $(CC) -o $(EXECNAME) $(SOURCES) $(LIB) $(LDFLAGS) $(CFLAGS)

But in fact I get this errors:

nvcc -o MatrixProduct CalcCirc.c Init.c main.c  -lcuda -lcudart -lcutil -lm -lmpich -lpthread -L/usr/local/mpich2-1.0.6.p1/lib -L/usr/local/cuda/lib -L/home/user/NVIDIA_CUDA_SDK/lib -L/home/user/NVIDIA_CUDA_SDK/common/lib -I/usr/local/mpich2-1.0.6.p1/include -I/usr/local/cuda/include -I/home/user/NVIDIA_CUDA_SDK/common/inc

In file included from Init.c:12:

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:123: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutReadFilefâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:139: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutReadFiledâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:155: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutReadFileiâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:170: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutReadFileuiâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:186: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutReadFilebâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:202: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutReadFileubâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:216: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutWriteFilefâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:230: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutWriteFiledâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:242: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutWriteFileiâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:254: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutWriteFileuiâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:266: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutWriteFilebâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:278: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutWriteFileubâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:294: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutLoadPGMubâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:307: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutLoadPPMubâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:321: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutLoadPPM4ubâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:337: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutLoadPGMiâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:353: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutLoadPGMsâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:368: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutLoadPGMfâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:380: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutSavePGMubâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:392: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutSavePPMubâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:405: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutSavePPM4ubâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:417: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutSavePGMiâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:429: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutSavePGMsâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:441: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutSavePGMfâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:462: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutCheckCmdLineFlagâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:476: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutGetCmdLineArgumentiâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:490: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutGetCmdLineArgumentfâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:504: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutGetCmdLineArgumentstrâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:519: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutGetCmdLineArgumentListstrâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:533: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutCheckConditionâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:545: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutComparefâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:558: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutCompareiâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:571: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutCompareubâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:585: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutComparefeâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:600: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutCompareL2feâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:613: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutCreateTimerâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:622: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutDeleteTimerâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:630: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutStartTimerâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:638: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutStopTimerâ€™

/home/user/NVIDIA_CUDA_SDK/common/inc/cutil.h:646: error: expected â€˜=â€™, â€˜,â€™, â€˜;â€™, â€˜asmâ€™ or â€˜__attribute__â€™ before â€˜cutResetTimerâ€™

Init.c: In function â€˜CUDAInitâ€™:

Init.c:62: error: â€˜cudaDevicePropâ€™ undeclared (first use in this function)

Init.c:62: error: (Each undeclared identifier is reported only once

Init.c:62: error: for each function it appears in.)

Init.c:62: error: expected â€˜;â€™ before â€˜devicePropâ€™

Init.c:63: error: â€˜devicePropâ€™ undeclared (first use in this function)

Init.c:75: error: expected â€˜;â€™ before â€˜}â€™ token

Init.c:120: error: expected declaration or statement at end of input

make: *** [all] Error 255

Anyone has any idea why ?

Thanks

gabriel_ro · April 9, 2008, 2:29pm

Ok for the above errors it seems it was my fault, because I didn’t rename the CUDA containing file using the .cu extension and that’s why the nvcc compiler wasn’t preprocessing it and was treating it as a normal C code passing it directrly to gcc.

Now I encounter another problem during my compile process.

In mpicxx.h I receive errors saying that support for exception handling is disabled.

Running nvcc with the -v switch, I see that cudafe is called using parameter “–no_exception”.

How can I change this parameter and tell cudafe not to use it ?

bcain · April 24, 2008, 4:43am

Try adding

--host-compilation=C++

to your nvcc call. If you have C++ code in your .cu, you’ll need it.

flexlm · May 30, 2008, 12:09pm

when i use mpich-1.2.7 erery think work fine when compile and link

but wheni run program, the cudaMalloc not work after MPI_Init.

so i change to mpich2 and now find a new error

error: expected an identifier

did you have any hints?

thanks advance

Litherum · June 29, 2009, 5:52pm

I have just successfully ran an MPI program that uses CUDA. I set it up so that I have a 2-node cluster, with a combined total of 10 CPUs. I launched 10 instances of MPI (1 per CPU) using mpdboot and a host file. My program is a simple MPI program that executes 1 process per CPU, and each of those processes calls a CUDA kernel.

This works because each of the CUDA cards in the cluster supports the “Default” compute mode, where a single card can be used simultaneously by multiple processes. However, running more kernels than there are CUDA cards is probably not the best practice.

Also, there are a total of 5 cuda cards in these 2 nodes, and is it stands all the processes running on the same host will execute on the same card, leaving all secondary and tertiary cards idle.

This makes me wonder if there is a CUDA-aware Torque / Maui module or plug-in out there, that can handle launching only as many kernels on a box as cards it has, and setting the active cuda card differently for each process on a specific box. It could possibly alter the data used in invoking the kernel so as to tune the program for the different cards. This may not be a good idea, however, as it gets in to how to actually split up the job, which is historically the job of the programmer and not the tools.

Also, MPI supports different processes talking to each other and syncing on each other, which is currently impossible to do with CUDA. This is due to the fact that not all blocks in a specific CUDA program are executing simultaneously if, say, there are more blocks that need to run than there are multprocessors available to run the blocks. So, the blocks themselves can’t talk to each other on the same host, and the threads within a block certainly cannot talk to a thread on a different host through some sort of MPI call. “Fixing” this problem probably is impossible as it it tied pretty directly with the architecture of the graphics chip. Plus, GPUs don’t know anything about network cards, so it would have to notify the CPU to run an MPI call, and then wait for the network lag, and the whole thing would be a disaster.

So, I guess what I’m trying to say is that with the current technologies, it only makes sense to use CUDA and MPI together if your problem has extremely fine-grained parallelism and ONLY has extremely local data dependencies (or no dependencies at all, which is the best case). This really narrows down the space of problems that this technology can attack. On the other hand, it can still do some very cool things.

For people searching on the net, here is what I did to make my simple MPI and CUDA application:

kernel.cu:

[codebox]#include <stdio.h>

global void kernel(int *array1, int *array2, int *array3)

{

    int index = blockIdx.x * blockDim.x + threadIdx.x;

    array3[index] = array1[index] + array2[index];

}

extern “C”

void run_kernel()

{

    int i, array1[6], array2[6], array3[6], *devarray1, *devarray2, *devarray3;

    for(i = 0; i < 6; i++)

    {

            array1[i] = i;

            array2[i] = 3-i;

    }

cudaMalloc((void**) &devarray1, sizeof(int)*6);

    cudaMalloc((void**) &devarray2, sizeof(int)*6);

    cudaMalloc((void**) &devarray3, sizeof(int)*6);

cudaMemcpy(devarray1, array1, sizeof(int)*6, cudaMemcpyHostToDevice);

    cudaMemcpy(devarray2, array2, sizeof(int)*6, cudaMemcpyHostToDevice);

kernel<<<2, 3>>>(devarray1, devarray2, devarray3);

cudaMemcpy(array3, devarray3, sizeof(int)*6, cudaMemcpyDeviceToHost);

for(i = 0; i < 6; i++)

    {

            printf("%d ", array3[i]);

    }

    printf("\n");

cudaFree(devarray1);

    cudaFree(devarray2);

    cudaFree(devarray3);

}

[/codebox]

mpi.c:

[codebox]#include <mpi.h>

void run_kernel();

int main(int argc, char *argv)

{

    int rank, size;

MPI_Init (&argc, &argv); /* starts MPI */

    MPI_Comm_rank (MPI_COMM_WORLD, &rank);  /* get current process id */

    MPI_Comm_size (MPI_COMM_WORLD, &size);  /* get number of processes */

    run_kernel();

    MPI_Finalize();

    return 0;

}

[/codebox]

now, for compilation:

[codebox]$ nvcc -c kernel.cu

$ mpicc -o mpicuda mpi.c kernel.o -lcudart -L /usr/local/cuda/lib -I /usr/local/cuda/include

[/codebox]

and running it:

[codebox]$ mpirun -l -np 10 ./mpicuda

1: 3 3 3 3 3 3

9: 3 3 3 3 3 3

8: 3 3 3 3 3 3

2: 3 3 3 3 3 3

7: 3 3 3 3 3 3

6: 3 3 3 3 3 3

0: 3 3 3 3 3 3

4: 3 3 3 3 3 3

5: 3 3 3 3 3 3

3: 3 3 3 3 3 3

[/codebox]

YDD · June 29, 2009, 7:10pm

I don’t think the schedulers really know about GPUs yet, but you can at least set the GPUs to be ‘compute-exclusive’ (search the forums for this and ‘nvidia-smi’).

Also, MPI supports different processes talking to each other and syncing on each other, which is currently impossible to do with CUDA. This is due to the fact that not all blocks in a specific CUDA program are executing simultaneously if, say, there are more blocks that need to run than there are multprocessors available to run the blocks. So, the blocks themselves can’t talk to each other on the same host, and the threads within a block certainly cannot talk to a thread on a different host through some sort of MPI call. “Fixing” this problem probably is impossible as it it tied pretty directly with the architecture of the graphics chip. Plus, GPUs don’t know anything about network cards, so it would have to notify the CPU to run an MPI call, and then wait for the network lag, and the whole thing would be a disaster.

It’s not a disaster at all. By your argument doing any parallel programming will be a disaster! You just use the GPUs to make each node faster. When they need to synchronise, they copy the relevant data back to their hosts, and the hosts do the necessary synchronisation over MPI. How well will this work? That’ll depend on how tightly coupled your problem is, and to what extent you want/can pipeline operations on the GPU. Altough the Cell is slightly different, check out the discussion of Reverse Acceleration talk given at the NCSA Accelerators meeting.

e.ping · June 30, 2009, 12:13am

Where exactly is the problem? We do “cudaSetDevice(rank%4)” for an S1070 based cluster all the time… (and that’s all that needs to be done) rank%2 for my workstation with a GTX280 and 8600GT. Guess it’s easy because we do not hybridly parallelise. We are sure however that er get nodes exclusively to ourselves, so no hassle with scheduling systems.

mfatica · June 30, 2009, 7:11am

If you want to use GPUs with a queuing system, set the exclusive mode and then use them as consumable resources ( similar to what you do with licenses).

Tonnellren · May 4, 2010, 5:15am

I am trying to complie a CUDA.MPI code.

nvcc -c cudampi.cu -I/home/delta1/abc/NVIDIA_CUDA_SDK_2.1/common/inc

mpicc -o mpicuda cudampi.c cudampi.o -lcudart -L /usr/local/cuda/lib -I /usr/local/cuda/include

However the responding error is

:">

“/usr/bin/ld: cannot find -lcuda”

Its kind if strange, I have done above on the same computer, I have passed successfully last time.

Can anyone help me? thanks a lot.

zeus13i · May 4, 2010, 7:07am

You can google this very easily.

Nonetheless, it seems to me like your ‘-lcudart’ should be on the NVCC line, not the MPICC line ?

avidday · May 4, 2010, 7:10am

Nope. The nvcc line is only compiling, the mpicc line is linking. Libraries have to be provided to the command that performs the linking.

avidday · May 4, 2010, 8:24am

That error doesn’t make sense. The linker will only report not finding -lcuda if you specified it during linking. Are you sure that you transcribed the error message correctly? For reference, here is what compiling with MPI (this is MPICH2) and linking with the CUDA version 3.0 runtime should look like;

avidday@cuda:~$ nvcc -c -arch=sm_13 jacsub.cu -o jacsub.o

avidday@cuda:~$ mpicc -o testcudampi testcudampi.c jacsub.o -L/opt/cuda-3.0/lib64 -lcudart 

avidday@cuda:~$ ldd testcudampi

	linux-vdso.so.1 =>  (0x00007fff6cf32000)

	libcudart.so.3 => /opt/cuda-3.0/lib64/libcudart.so.3 (0x00007f013e3b6000)

	libpthread.so.0 => /lib/libpthread.so.0 (0x00007f013e19a000)

	librt.so.1 => /lib/librt.so.1 (0x00007f013df92000)

	libc.so.6 => /lib/libc.so.6 (0x00007f013dc20000)

	libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f013d913000)

	libm.so.6 => /lib/libm.so.6 (0x00007f013d68e000)

	libdl.so.2 => /lib/libdl.so.2 (0x00007f013d48a000)

	libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007f013d272000)

	/lib64/ld-linux-x86-64.so.2 (0x00007f013e5f1000)

You can see that there is no dependency on the CUDA driver library (libcuda.so) in the executable…

gshi · May 4, 2010, 2:07pm

a wild guess, may you need use

-L /usr/local/cuda/lib64

?

NANDY · July 26, 2010, 2:21am

I have just successfully ran an MPI program that uses CUDA. I set it up so that I have a 2-node cluster, with a combined total of 10 CPUs. I launched 10 instances of MPI (1 per CPU) using mpdboot and a host file. My program is a simple MPI program that executes 1 process per CPU, and each of those processes calls a CUDA kernel.

This works because each of the CUDA cards in the cluster supports the “Default” compute mode, where a single card can be used simultaneously by multiple processes. However, running more kernels than there are CUDA cards is probably not the best practice.

Also, there are a total of 5 cuda cards in these 2 nodes, and is it stands all the processes running on the same host will execute on the same card, leaving all secondary and tertiary cards idle.

This makes me wonder if there is a CUDA-aware Torque / Maui module or plug-in out there, that can handle launching only as many kernels on a box as cards it has, and setting the active cuda card differently for each process on a specific box. It could possibly alter the data used in invoking the kernel so as to tune the program for the different cards. This may not be a good idea, however, as it gets in to how to actually split up the job, which is historically the job of the programmer and not the tools.

Also, MPI supports different processes talking to each other and syncing on each other, which is currently impossible to do with CUDA. This is due to the fact that not all blocks in a specific CUDA program are executing simultaneously if, say, there are more blocks that need to run than there are multprocessors available to run the blocks. So, the blocks themselves can’t talk to each other on the same host, and the threads within a block certainly cannot talk to a thread on a different host through some sort of MPI call. “Fixing” this problem probably is impossible as it it tied pretty directly with the architecture of the graphics chip. Plus, GPUs don’t know anything about network cards, so it would have to notify the CPU to run an MPI call, and then wait for the network lag, and the whole thing would be a disaster.

So, I guess what I’m trying to say is that with the current technologies, it only makes sense to use CUDA and MPI together if your problem has extremely fine-grained parallelism and ONLY has extremely local data dependencies (or no dependencies at all, which is the best case). This really narrows down the space of problems that this technology can attack. On the other hand, it can still do some very cool things.

For people searching on the net, here is what I did to make my simple MPI and CUDA application:

kernel.cu:

[codebox]include <stdio.h>

global void kernel(int *array1, int *array2, int *array3)

{
    int index = blockIdx.x * blockDim.x + threadIdx.x;

    array3[index] = array1[index] + array2[index];
}

extern “C”

void run_kernel()

{
    int i, array1[6], array2[6], array3[6], *devarray1, *devarray2, *devarray3;

    for(i = 0; i < 6; i++)

    {

            array1[i] = i;

            array2[i] = 3-i;

    }
cudaMalloc((void**) &devarray1, sizeof(int)*6);
    cudaMalloc((void**) &devarray2, sizeof(int)*6);

    cudaMalloc((void**) &devarray3, sizeof(int)*6);
cudaMemcpy(devarray1, array1, sizeof(int)*6, cudaMemcpyHostToDevice);
    cudaMemcpy(devarray2, array2, sizeof(int)*6, cudaMemcpyHostToDevice);
kernel<<<2, 3>>>(devarray1, devarray2, devarray3);

cudaMemcpy(array3, devarray3, sizeof(int)*6, cudaMemcpyDeviceToHost);

for(i = 0; i < 6; i++)
    {

            printf("%d ", array3[i]);

    }

    printf("\n");
cudaFree(devarray1);
    cudaFree(devarray2);

    cudaFree(devarray3);
}

[/codebox]

mpi.c:

[codebox]include <mpi.h>

void run_kernel();

int main(int argc, char *argv)

{
    int rank, size;
MPI_Init (&argc, &argv); /* starts MPI */
    MPI_Comm_rank (MPI_COMM_WORLD, &rank);  /* get current process id */

    MPI_Comm_size (MPI_COMM_WORLD, &size);  /* get number of processes */

    run_kernel();

    MPI_Finalize();

    return 0;
}

[/codebox]

now, for compilation:

[codebox]$ nvcc -c kernel.cu

$ mpicc -o mpicuda mpi.c kernel.o -lcudart -L /usr/local/cuda/lib -I /usr/local/cuda/include

[/codebox]

and running it:

[codebox]$ mpirun -l -np 10 ./mpicuda

1: 3 3 3 3 3 3

9: 3 3 3 3 3 3

8: 3 3 3 3 3 3

2: 3 3 3 3 3 3

7: 3 3 3 3 3 3

6: 3 3 3 3 3 3

0: 3 3 3 3 3 3

4: 3 3 3 3 3 3

5: 3 3 3 3 3 3

3: 3 3 3 3 3 3

[/codebox]

Hi,

I am doing a project that has two compute nodes with Gpu’s inside and one frontend without a gpu in it. We are using rocks cluster 5.3 as our software and we installed cuda on all the three machines. We are implementing a math problem and We are using mpi to parallize and cuda on GPU’s. Actually we are learning mpi and cuda and for just a sample program to run we used ur sample program posted above. When we run the code we are getting an error if we use -l in the option to execute the mpicuda. The error we are getting if we use -l is

[codebox]

[cudauser@frontend cudampi]$ nvcc -c kernel.cu

[cudauser@frontend cudampi]$ mpicc -o mpicuda mpi.c kernel.o -lcudart -L /usr/local/cuda/lib -I /usr/local/cuda/include

[cudauser@frontend cudampi]$ mpirun -l -np 10 ./mpicuda

mpirun was unable to launch the specified application as it could not find an executable:

Executable: -l

Node: frontend.local

while attempting to start process rank 0.

[/codebox]

and if we dont use -l in the execution command we are getting the following output which is different from your output

[codebox]

[cudauser@frontend cudampi]$ mpirun -np 10 ./mpicuda

-1 10874816 134513652 10876504 -1076555184 10820776

-1 10874816 134513652 10876504 -1076013248 10820776

-1 10874816 134513652 10876504 -1077478304 10820776

-1 10874816 134513652 10876504 -1081820496 10820776

-1 10874816 134513652 10876504 -1080191616 10820776

-1 10874816 134513652 10876504 -1078355488 10820776

-1 10874816 134513652 10876504 -1074501952 10820776

-1 10874816 134513652 10876504 -1077205840 10820776

-1 10874816 134513652 10876504 -1078776560 10820776

-1 10874816 134513652 10876504 -1075930256 10820776

[/codebox]

Can u please help us with this issue.

Topic		Replies	Views
Mixed Programing combining MPI and CUDA CUDA Programming and Performance	19	6780	May 5, 2009
using all 4 GPUs in S1070 from multi-core cpu? how CUDA Programming and Performance	11	32501	December 13, 2010
CUDA Cluster Programming Any1 Experienced? CUDA Programming and Performance	12	7152	December 5, 2008
about running cuda on a gpu cluster CUDA Programming and Performance	25	21717	May 31, 2010
accelerate a single loop with mpi and gpu Legacy PGI Compilers	21	16050	July 19, 2013
CUDA and MPI cmbination CUDA Programming and Performance	3	3594	April 24, 2012
Kernels launch - parallel or serial? CUDA Programming and Performance	16	7012	January 11, 2010
'Computations server' application design advice CUDA Programming and Performance	24	12850	March 23, 2007
How to compile MPI CUDA code? CUDA Programming and Performance	5	22555	May 4, 2009
CUDA and MPI Cluster Computing Implementation. Need advice for setting up MPI and CUDA over a cluste CUDA Programming and Performance	2	2545	February 19, 2010

MPI and CUDA mixed programming General CUDA programming

Related topics