MPI and CUDA mixed programming General CUDA programming

Hi,

I am doing a project that has two compute nodes with Gpu’s inside and one frontend without a gpu in it. We are using rocks cluster 5.3 as our software and we installed cuda on all the three machines. We are implementing a math problem and We are using mpi to parallize and cuda on GPU’s. Actually we are learning mpi and cuda and for just a sample program to run we used ur sample program posted above. When we run the code we are getting an error if we use -l in the option to execute the mpicuda. The error we are getting if we use -l is

[codebox]

[cudauser@frontend cudampi]$ nvcc -c kernel.cu

[cudauser@frontend cudampi]$ mpicc -o mpicuda mpi.c kernel.o -lcudart -L /usr/local/cuda/lib -I /usr/local/cuda/include

[cudauser@frontend cudampi]$ mpirun -l -np 10 ./mpicuda


mpirun was unable to launch the specified application as it could not find an executable:

Executable: -l

Node: frontend.local

while attempting to start process rank 0.

[/codebox]

and if we dont use -l in the execution command we are getting the following output which is different from your output

[codebox]

[cudauser@frontend cudampi]$ mpirun -np 10 ./mpicuda

-1 10874816 134513652 10876504 -1076555184 10820776

-1 10874816 134513652 10876504 -1076013248 10820776

-1 10874816 134513652 10876504 -1077478304 10820776

-1 10874816 134513652 10876504 -1081820496 10820776

-1 10874816 134513652 10876504 -1080191616 10820776

-1 10874816 134513652 10876504 -1078355488 10820776

-1 10874816 134513652 10876504 -1074501952 10820776

-1 10874816 134513652 10876504 -1077205840 10820776

-1 10874816 134513652 10876504 -1078776560 10820776

-1 10874816 134513652 10876504 -1075930256 10820776

[/codebox]

Can u please help us with this issue.

Hi,

I am doing a project that has two compute nodes with Gpu’s inside and one frontend without a gpu in it. We are using rocks cluster 5.3 as our software and we installed cuda on all the three machines. We are implementing a math problem and We are using mpi to parallize and cuda on GPU’s. Actually we are learning mpi and cuda and for just a sample program to run we used ur sample program posted above. When we run the code we are getting an error if we use -l in the option to execute the mpicuda. The error we are getting if we use -l is

[codebox]

[cudauser@frontend cudampi]$ nvcc -c kernel.cu

[cudauser@frontend cudampi]$ mpicc -o mpicuda mpi.c kernel.o -lcudart -L /usr/local/cuda/lib -I /usr/local/cuda/include

[cudauser@frontend cudampi]$ mpirun -l -np 10 ./mpicuda


mpirun was unable to launch the specified application as it could not find an executable:

Executable: -l

Node: frontend.local

while attempting to start process rank 0.

[/codebox]

and if we dont use -l in the execution command we are getting the following output which is different from your output

[codebox]

[cudauser@frontend cudampi]$ mpirun -np 10 ./mpicuda

-1 10874816 134513652 10876504 -1076555184 10820776

-1 10874816 134513652 10876504 -1076013248 10820776

-1 10874816 134513652 10876504 -1077478304 10820776

-1 10874816 134513652 10876504 -1081820496 10820776

-1 10874816 134513652 10876504 -1080191616 10820776

-1 10874816 134513652 10876504 -1078355488 10820776

-1 10874816 134513652 10876504 -1074501952 10820776

-1 10874816 134513652 10876504 -1077205840 10820776

-1 10874816 134513652 10876504 -1078776560 10820776

-1 10874816 134513652 10876504 -1075930256 10820776

[/codebox]

Can u please help us with this issue.

If you have only GPUs < GF100 then that may be an adequate approach–treat the GPUs as resources that can not be shared or sized (very much)–but see below.

Fair warning, just because I am from Psi Lambda LLC does not make any of the following less true (I implemented and ran a majority of the analysis of the first complete genome of a cancer patient and have used that experience of large scale cluster computing in the design requirements of the Kappa library):

For the GF100 (Fermi) GPUs this is not all true. Just because NVIDIA does not have any examples of general full loading of the GPU using concurrent kernels does not mean it is impossible–the Kappa Library does it. Moreover, there is an example (look at the version 1.3 features announcement) on the psilambda.com website showing how to set up dynamically scaled computing (replace the calls to read from a SQL datasource with calls to some other more MPI load scheduling aware source–unless a SQL datasource is your MPI load scheduling source). The Kappa library only does synchronization as dictated by data flow and, more locally, by CUDA or OpenMP parallel regions.

Since the Kappa library allows for concurrent kernel execution of any set of batches of kernels, allows asynchronous SQL and CPU kernel calls, allows using an index component notation for sizing and scheduling the CPU/GPU kernel operations, allows dynamic sizing of kernel launches (with those indexes and dynamic parameters available as kernel arguments also) and memory allocation, then there is absolutely no reason to not get full dynamic use from GPU clusters. If your task consists of categories with equal size chunks (and the hardware is homogeneous) then it is easy. If not, then you have more work to do to get the right mix but it is still very possible. Please note–the way the Kappa library scheduler works (producer/consumer data flow), independent jobs may be submitted on the same GPU/CPU and run concurrently–the only trick is to not get bottlenecked on the GPU/CPU/memory cluster node–which is where MPI comes in anyway. The Kappa library makes attributes of the GPU and kernels fully available for runtime sizing calculations.

The Psi Lambda LLC Kappa Library Process objects are one per host thread per GPU such that, for multi-gpu operation, you would use MPI to provide the coordination between each Kappa Library Process object regardless of whether they are on the same host. If you are only interested in the GPUs, then effectively each Process object becomes a cluster node. Realistically, it will be more complicated since each Process object offers CPU and host memory resources also. Also Process objects on the same host can all be within the same program and therefore have less transfer latency than Process objects on other hosts.