about running cuda on a gpu cluster

I’ve now currently running a mpi program which calls a few cuda routines on a gpu cluster, but there is unexpected runtime errors, like:
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x8
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x8

I’ve debugged the program. It seems that these error resulted form the call of routin ‘cublas_set_matrix’ (I’m not 100% sure.) I’ve run the same program on a one-node server with one gpu. There was no error. Now the running environment is a culster which I am not that familiar with. What I can tell is that I’m using omenmpi, each node in the cluster has 4 cores and 2 gpus. I’m at a loss now. Can anyone give me hints on the source of these errors?

Is it a must that I assign threads to certain gpus explicitly?

Is it a must that I assign threads to certain gpus explicitly?

These people had a successful resolution to a similar issue. Basically, the shared libraries on their cluster were not configured the same on the compute nodes as they were on the head node, and a reinstall of Mesa fixed the problem.

These people had a successful resolution to a similar issue. Basically, the shared libraries on their cluster were not configured the same on the compute nodes as they were on the head node, and a reinstall of Mesa fixed the problem.

Many thanks! BTW would you please give the url of the certain discussion?

Many thanks! BTW would you please give the url of the certain discussion?

Indeed I knew nothing about Mesa… Is it possible to reinstall in without root permission?

Indeed I knew nothing about Mesa… Is it possible to reinstall in without root permission?

http://www.paraview.org/pipermail/paraview…ary/015635.html

http://www.paraview.org/pipermail/paraview…ary/015635.html

Yes, it is. However this may or may not have anything to do with your specific problem. My recommendation is reading through that thread to see how they went about diagnosing their error (running with valgrind, etc.) Also, I would probably try enlisting the help of the people that configured CUDA on your cluster.

Yes, it is. However this may or may not have anything to do with your specific problem. My recommendation is reading through that thread to see how they went about diagnosing their error (running with valgrind, etc.) Also, I would probably try enlisting the help of the people that configured CUDA on your cluster.

I’ve asked the administrator of the cluster. It is said that the cuda on the cluster is installed in a shared directory. So does it mean the error was not resulted from the inproper configuration? Is there possiblity there is boundary violation that caused the phenomenon?

For your convience, I paste more information of the error message:

[c0219:26117] *** Process received signal ***

[c0219:26117] Signal: Segmentation fault (11)

[c0219:26117] Signal code: Address not mapped (1)

[c0219:26117] Failing at address: 0x8

[c0218:26574] *** Process received signal ***

[c0214:14338] *** Process received signal ***

[c0218:26574] Signal: Segmentation fault (11)

[c0218:26574] Signal code: Address not mapped (1)

[c0218:26574] Failing at address: 0x8

[c0209:12115] *** Process received signal ***

[c0219:26117] [ 0] /lib64/libpthread.so.0 [0x2b64f8f594c0]

[c0219:26117] [ 1] /export/cuda/lib64/libcudart.so.2(cudaMemcpy+0x9e) [0x2b64f96da56e]

[c0219:26117] [ 2] /export/cuda/lib64/libcublas.so.2(cublasSetMatrix+0x14f) [0x2b64f5b457ff]

[c0219:26117] [ 3] ./gpu(cublas_set_matrix_+0x62) [0x51614e]

I ran the program on 8 nodes, and this error occured on 4 of them.

That is almost certainly a memory management error in your MPI code. Your MPI communicator (looks like OpenMPI from the form of the error messages), is telling you that your code is trying to access a memory address which is outside of its valid memory allocation. Your MPI processes are clearly getting an invalid address (the 0x8) from somewhere [that sort of value is clearly neither a host nor GPU address].

Time to start looking more closely at your code.

But the program can be run through on a one-node with single gpu environment. Before I tested it on the cluster, everything was fine on my local centos 5.4 system.

In fact, I’m trying to replace the zgemm routine in the original code with cublas_zgemm, and the original version also ran well. My modifications on the code just include simple routines such as cudaInit and cudaAlloc. The error message said that the calling of cublas_set_matrix caused error. To me it seems that there is no problem with my code. And if it does, since the routine is called by all the 8 threads at the same time, it is so strange that only 4 of them have error.

That’s why I’m so freaked out…

That doesn’t mean anything in the context of MPI programming. If you run on 1 node, your code is only ever running inside a single processes memory space, and any “accidental” memory usage errors aren’t visible.

Right. And to call that you need to provide a host memory pointer and a GPU memory pointer. One of them is wrong. The MPI error trap shows that to be the case.

Again, that doesn’t mean anything in the context of MPI. It might well be that you get to see 4 have the error before the MPI communicator shuts itself, but it doesn’t mean all 8 don’t have that problem. What is to say that some of the 8 nodes never reach the broken code before the first ones do and kill the whole MPI communicator before the rest hit the error?

CUBLAS and CUDA work perfectly with MPI - I have run my CUDA accelerated version of MPI linpack on 16 CUDA nodes without a problem, and my group has a lot of MPI/CUDA hybrid applications in daily use on our cluster. My experience tells me you are doing something wrong, you just don’t realize it.

I can confirm that the other 4 reached the broken code and went through, because I let them write the returned value of cublas_set_matrix. The value is ‘0’, which means the routine was called successfully. The left 4 threw the error message rather the returned value. In this way I can also confirm that the cublasAlloc step before cublas_set_matrix is also successful on all nodes.

My other question is, if the pointer in the host memory is wrong, why my program with zgemm did everything right? If the pointer in the gpu memroy is wrong, it is also strange because what I did with the gpu memory was simplying allcating some memory and was about to use it.

I appreciate your suggestion very much! But indeed the above questions have haunted me for a long long time…

Are the nodes that fail always the same physical nodes? Can you run a reduced job on just the nodes you believe work (like only one or two)? Can you put together a much simpler case where each process in the communicator just opens a context on the GPU allocates some memory and copies to and from?

The could be lots of reasons why - for example, are you really sure that each node in the MPI communicator is actually connecting to a different GPU? You could either be hitting compute exclusivity problems or seeing resource contention if the GPUs aren’t actually all different devices.

Not the GPU pointer necessarily, but the GPU context. The underlying GPU context is tied to the thread that opened it and the device it was opened on. If you are running on cluster nodes with multiple GPUs, it is possible to wind up violating the context-device association, and that will make the program abort.

There are lots of ways that things can go wrong depending on your code and the topology of the cluster you are using.

Thanks very much for such detailed explaination… I’ll follow your suggestion to try to fix the problem. Any progress I’ll keep it updated!