Error while running an sample MPI with CUDA inside it.


I am doing a project that has two compute nodes with Gpu’s inside and one frontend without a gpu in it. We are using rocks cluster 5.3 as our software and we installed cuda on all the three machines. We are implementing a math problem and We are using mpi to parallize and cuda on GPU’s. Actually we are learning mpi and cuda and for just a sample program to run we used the sample program from the below post. When we run the code we are getting an error if we use -l in the option to execute the mpicuda. We use rocks cluster 5.3 to manage cluster and we installed cuda in all of them and we dont have GPU’s yet and we are running in emulator mode. The error we are getting if we use -l is


[cudauser@frontend cudampi] nvcc -c [cudauser@frontend cudampi] mpicc -o mpicuda mpi.c kernel.o -lcudart -L /usr/local/cuda/lib -I /usr/local/cuda/include
[cudauser@frontend cudampi]$ mpirun -l -np 10 ./mpicuda

mpirun was unable to launch the specified application as it could not find an executable:

Executable: -l
Node: frontend.local

while attempting to start process rank 0.

and if we dont use -l in the execution command we are getting the following output which is different from your output


[cudauser@frontend cudampi]$ mpirun -np 10 ./mpicuda
-1 10874816 134513652 10876504 -1076555184 10820776
-1 10874816 134513652 10876504 -1076013248 10820776
-1 10874816 134513652 10876504 -1077478304 10820776
-1 10874816 134513652 10876504 -1081820496 10820776
-1 10874816 134513652 10876504 -1080191616 10820776
-1 10874816 134513652 10876504 -1078355488 10820776
-1 10874816 134513652 10876504 -1074501952 10820776
-1 10874816 134513652 10876504 -1077205840 10820776
-1 10874816 134513652 10876504 -1078776560 10820776
-1 10874816 134513652 10876504 -1075930256 10820776

Can anyone please help me with this issue.

The post we use is this one:

QUOTE (Litherum @ Jun 29 2009, 10:52 AM)
I have just successfully ran an MPI program that uses CUDA. I set it up so that I have a 2-node cluster, with a combined total of 10 CPUs. I launched 10 instances of MPI (1 per CPU) using mpdboot and a host file. My program is a simple MPI program that executes 1 process per CPU, and each of those processes calls a CUDA kernel.

This works because each of the CUDA cards in the cluster supports the “Default” compute mode, where a single card can be used simultaneously by multiple processes. However, running more kernels than there are CUDA cards is probably not the best practice.

Also, there are a total of 5 cuda cards in these 2 nodes, and is it stands all the processes running on the same host will execute on the same card, leaving all secondary and tertiary cards idle.

This makes me wonder if there is a CUDA-aware Torque / Maui module or plug-in out there, that can handle launching only as many kernels on a box as cards it has, and setting the active cuda card differently for each process on a specific box. It could possibly alter the data used in invoking the kernel so as to tune the program for the different cards. This may not be a good idea, however, as it gets in to how to actually split up the job, which is historically the job of the programmer and not the tools.

Also, MPI supports different processes talking to each other and syncing on each other, which is currently impossible to do with CUDA. This is due to the fact that not all blocks in a specific CUDA program are executing simultaneously if, say, there are more blocks that need to run than there are multprocessors available to run the blocks. So, the blocks themselves can’t talk to each other on the same host, and the threads within a block certainly cannot talk to a thread on a different host through some sort of MPI call. “Fixing” this problem probably is impossible as it it tied pretty directly with the architecture of the graphics chip. Plus, GPUs don’t know anything about network cards, so it would have to notify the CPU to run an MPI call, and then wait for the network lag, and the whole thing would be a disaster.

So, I guess what I’m trying to say is that with the current technologies, it only makes sense to use CUDA and MPI together if your problem has extremely fine-grained parallelism and ONLY has extremely local data dependencies (or no dependencies at all, which is the best case). This really narrows down the space of problems that this technology can attack. On the other hand, it can still do some very cool things.

For people searching on the net, here is what I did to make my simple MPI and CUDA application:
#include <stdio.h>

global void kernel(int *array1, int *array2, int *array3)
int index = blockIdx.x * blockDim.x + threadIdx.x;
array3[index] = array1[index] + array2[index];

extern “C”
void run_kernel()
int i, array1[6], array2[6], array3[6], *devarray1, *devarray2, *devarray3;
for(i = 0; i < 6; i++)
array1[i] = i;
array2[i] = 3-i;

    cudaMalloc((void**) &devarray1, sizeof(int)*6);
    cudaMalloc((void**) &devarray2, sizeof(int)*6);
    cudaMalloc((void**) &devarray3, sizeof(int)*6);

    cudaMemcpy(devarray1, array1, sizeof(int)*6, cudaMemcpyHostToDevice);
    cudaMemcpy(devarray2, array2, sizeof(int)*6, cudaMemcpyHostToDevice);

    kernel<<<2, 3>>>(devarray1, devarray2, devarray3);

    cudaMemcpy(array3, devarray3, sizeof(int)*6, cudaMemcpyDeviceToHost);

    for(i = 0; i < 6; i++)
            printf("%d ", array3[i]);



#include <mpi.h>

void run_kernel();

int main(int argc, char *argv)
int rank, size;

    MPI_Init (&argc, &argv);        /* starts MPI */
    MPI_Comm_rank (MPI_COMM_WORLD, &rank);  /* get current process id */
    MPI_Comm_size (MPI_COMM_WORLD, &size);  /* get number of processes */
    return 0;


now, for compilation:

nvcc -c mpicc -o mpicuda mpi.c kernel.o -lcudart -L /usr/local/cuda/lib -I /usr/local/cuda/include

and running it:

$ mpirun -l -np 10 ./mpicuda
1: 3 3 3 3 3 3
9: 3 3 3 3 3 3
8: 3 3 3 3 3 3
2: 3 3 3 3 3 3
7: 3 3 3 3 3 3
6: 3 3 3 3 3 3
0: 3 3 3 3 3 3
4: 3 3 3 3 3 3
5: 3 3 3 3 3 3
3: 3 3 3 3 3 3