about running cuda on a gpu cluster

Did you run your job on eight nodes or eight MPI processes (-np 8)? Given the cluster topology, my guess now is that the admin configured openmpi to run two MPI processes per node and the second process of your code on each node is failing due to the exclusivity conflict described by avidday.

I’ve tried run it on 8 nodes, 4 nodes and 2 nodes. Each node contain a 4-core cpu and 2 tesla gpus. All the conditions encountered this error. What I described in this thread is the case of 8 nodes.

When you say “8 nodes” do you mean 4 physical hosts with 8 GPUs upon which you launch 8 MPI processes or something else?

8 physical nodes. So there are 16 GPUs. May it cause trouble? I’ve also tried the case of 4 physical hosts with 8 GPUs. error…

This is probably a GPU affinity problem. You need to make sure that each MPI process is really using a unique GPU.

Here is test result. I’ve built a very simple mpi+cuda program which just simpily does cublas_zgemm. I ran it with 8 processes on 4 nodes. It passed. Then I ran my original program on the same nodes with 8 processes. There was error, when calling cublas_set_matrix. The cuda routines called by it are just the same as the successful one…

ps. I didn’t build any cuda context in both programs. Instead cublas_init was called. The interactions between cpu and gpu in both cases seem to be the same. Why one was successful while the other was not…