about running cuda on a gpu cluster

shadowperi · May 24, 2010, 2:21pm

I’ve now currently running a mpi program which calls a few cuda routines on a gpu cluster, but there is unexpected runtime errors, like:
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x8
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x8

I’ve debugged the program. It seems that these error resulted form the call of routin ‘cublas_set_matrix’ (I’m not 100% sure.) I’ve run the same program on a one-node server with one gpu. There was no error. Now the running environment is a culster which I am not that familiar with. What I can tell is that I’m using omenmpi, each node in the cluster has 4 cores and 2 gpus. I’m at a loss now. Can anyone give me hints on the source of these errors?

shadowperi · May 24, 2010, 2:25pm

I’ve now currently running a mpi program which calls a few cuda routines on a gpu cluster, but there is unexpected runtime errors, like:

Signal: Segmentation fault (11)

Signal code: Address not mapped (1)

Failing at address: 0x8

*** Process received signal ***

Signal: Segmentation fault (11)

Signal code: Address not mapped (1)

Failing at address: 0x8

I’ve debugged the program. It seems that these error resulted form the call of routin ‘cublas_set_matrix’ (I’m not 100% sure.) I’ve run the same program on a one-node server with one gpu. There was no error. Now the running environment is a culster which I am not that familiar with. What I can tell is that I’m using omenmpi, each node in the cluster has 4 cores and 2 gpus. I’m at a loss now. Can anyone give me hints on the source of these errors?

Is it a must that I assign threads to certain gpus explicitly?

shadowperi · May 24, 2010, 2:25pm

I’ve now currently running a mpi program which calls a few cuda routines on a gpu cluster, but there is unexpected runtime errors, like:

Signal: Segmentation fault (11)

Signal code: Address not mapped (1)

Failing at address: 0x8

*** Process received signal ***

Signal: Segmentation fault (11)

Signal code: Address not mapped (1)

Failing at address: 0x8

I’ve debugged the program. It seems that these error resulted form the call of routin ‘cublas_set_matrix’ (I’m not 100% sure.) I’ve run the same program on a one-node server with one gpu. There was no error. Now the running environment is a culster which I am not that familiar with. What I can tell is that I’m using omenmpi, each node in the cluster has 4 cores and 2 gpus. I’m at a loss now. Can anyone give me hints on the source of these errors?

Is it a must that I assign threads to certain gpus explicitly?

Tom_Milledge · May 24, 2010, 3:55pm

I’ve now currently running a mpi program which calls a few cuda routines on a gpu cluster, but there is unexpected runtime errors, like:

Signal: Segmentation fault (11)

Signal code: Address not mapped (1)

Failing at address: 0x8

*** Process received signal ***

Signal: Segmentation fault (11)

Signal code: Address not mapped (1)

Failing at address: 0x8

I’ve debugged the program. It seems that these error resulted form the call of routin ‘cublas_set_matrix’ (I’m not 100% sure.) I’ve run the same program on a one-node server with one gpu. There was no error. Now the running environment is a culster which I am not that familiar with. What I can tell is that I’m using omenmpi, each node in the cluster has 4 cores and 2 gpus. I’m at a loss now. Can anyone give me hints on the source of these errors?

These people had a successful resolution to a similar issue. Basically, the shared libraries on their cluster were not configured the same on the compute nodes as they were on the head node, and a reinstall of Mesa fixed the problem.

Tom_Milledge · May 24, 2010, 3:55pm

I’ve now currently running a mpi program which calls a few cuda routines on a gpu cluster, but there is unexpected runtime errors, like:

Signal: Segmentation fault (11)

Signal code: Address not mapped (1)

Failing at address: 0x8

*** Process received signal ***

Signal: Segmentation fault (11)

Signal code: Address not mapped (1)

Failing at address: 0x8

I’ve debugged the program. It seems that these error resulted form the call of routin ‘cublas_set_matrix’ (I’m not 100% sure.) I’ve run the same program on a one-node server with one gpu. There was no error. Now the running environment is a culster which I am not that familiar with. What I can tell is that I’m using omenmpi, each node in the cluster has 4 cores and 2 gpus. I’m at a loss now. Can anyone give me hints on the source of these errors?

These people had a successful resolution to a similar issue. Basically, the shared libraries on their cluster were not configured the same on the compute nodes as they were on the head node, and a reinstall of Mesa fixed the problem.

shadowperi · May 25, 2010, 5:40am

Many thanks! BTW would you please give the url of the certain discussion?

shadowperi · May 25, 2010, 5:40am

Many thanks! BTW would you please give the url of the certain discussion?

shadowperi · May 25, 2010, 5:53am

Indeed I knew nothing about Mesa… Is it possible to reinstall in without root permission?

shadowperi · May 25, 2010, 5:53am

Indeed I knew nothing about Mesa… Is it possible to reinstall in without root permission?

Tom_Milledge · May 25, 2010, 5:04pm

http://www.paraview.org/pipermail/paraview…ary/015635.html

Tom_Milledge · May 25, 2010, 5:04pm

http://www.paraview.org/pipermail/paraview…ary/015635.html

Tom_Milledge · May 25, 2010, 5:12pm

Yes, it is. However this may or may not have anything to do with your specific problem. My recommendation is reading through that thread to see how they went about diagnosing their error (running with valgrind, etc.) Also, I would probably try enlisting the help of the people that configured CUDA on your cluster.

Tom_Milledge · May 25, 2010, 5:12pm

Yes, it is. However this may or may not have anything to do with your specific problem. My recommendation is reading through that thread to see how they went about diagnosing their error (running with valgrind, etc.) Also, I would probably try enlisting the help of the people that configured CUDA on your cluster.

shadowperi · May 28, 2010, 10:42am

I’ve asked the administrator of the cluster. It is said that the cuda on the cluster is installed in a shared directory. So does it mean the error was not resulted from the inproper configuration? Is there possiblity there is boundary violation that caused the phenomenon?

For your convience, I paste more information of the error message:

[c0219:26117] *** Process received signal ***

[c0219:26117] Signal: Segmentation fault (11)

[c0219:26117] Signal code: Address not mapped (1)

[c0219:26117] Failing at address: 0x8

[c0218:26574] *** Process received signal ***

[c0214:14338] *** Process received signal ***

[c0218:26574] Signal: Segmentation fault (11)

[c0218:26574] Signal code: Address not mapped (1)

[c0218:26574] Failing at address: 0x8

[c0209:12115] *** Process received signal ***

[c0219:26117] [ 0] /lib64/libpthread.so.0 [0x2b64f8f594c0]

[c0219:26117] [ 1] /export/cuda/lib64/libcudart.so.2(cudaMemcpy+0x9e) [0x2b64f96da56e]

[c0219:26117] [ 2] /export/cuda/lib64/libcublas.so.2(cublasSetMatrix+0x14f) [0x2b64f5b457ff]

[c0219:26117] [ 3] ./gpu(cublas_set_matrix_+0x62) [0x51614e]

…

I ran the program on 8 nodes, and this error occured on 4 of them.

avidday · May 28, 2010, 10:57am

That is almost certainly a memory management error in your MPI code. Your MPI communicator (looks like OpenMPI from the form of the error messages), is telling you that your code is trying to access a memory address which is outside of its valid memory allocation. Your MPI processes are clearly getting an invalid address (the 0x8) from somewhere [that sort of value is clearly neither a host nor GPU address].

Time to start looking more closely at your code.

shadowperi · May 28, 2010, 11:49am

But the program can be run through on a one-node with single gpu environment. Before I tested it on the cluster, everything was fine on my local centos 5.4 system.

In fact, I’m trying to replace the zgemm routine in the original code with cublas_zgemm, and the original version also ran well. My modifications on the code just include simple routines such as cudaInit and cudaAlloc. The error message said that the calling of cublas_set_matrix caused error. To me it seems that there is no problem with my code. And if it does, since the routine is called by all the 8 threads at the same time, it is so strange that only 4 of them have error.

That’s why I’m so freaked out…

avidday · May 28, 2010, 12:00pm

That doesn’t mean anything in the context of MPI programming. If you run on 1 node, your code is only ever running inside a single processes memory space, and any “accidental” memory usage errors aren’t visible.

Right. And to call that you need to provide a host memory pointer and a GPU memory pointer. One of them is wrong. The MPI error trap shows that to be the case.

Again, that doesn’t mean anything in the context of MPI. It might well be that you get to see 4 have the error before the MPI communicator shuts itself, but it doesn’t mean all 8 don’t have that problem. What is to say that some of the 8 nodes never reach the broken code before the first ones do and kill the whole MPI communicator before the rest hit the error?

CUBLAS and CUDA work perfectly with MPI - I have run my CUDA accelerated version of MPI linpack on 16 CUDA nodes without a problem, and my group has a lot of MPI/CUDA hybrid applications in daily use on our cluster. My experience tells me you are doing something wrong, you just don’t realize it.

shadowperi · May 28, 2010, 12:26pm

I can confirm that the other 4 reached the broken code and went through, because I let them write the returned value of cublas_set_matrix. The value is ‘0’, which means the routine was called successfully. The left 4 threw the error message rather the returned value. In this way I can also confirm that the cublasAlloc step before cublas_set_matrix is also successful on all nodes.

My other question is, if the pointer in the host memory is wrong, why my program with zgemm did everything right? If the pointer in the gpu memroy is wrong, it is also strange because what I did with the gpu memory was simplying allcating some memory and was about to use it.

I appreciate your suggestion very much! But indeed the above questions have haunted me for a long long time…

avidday · May 28, 2010, 1:50pm

Are the nodes that fail always the same physical nodes? Can you run a reduced job on just the nodes you believe work (like only one or two)? Can you put together a much simpler case where each process in the communicator just opens a context on the GPU allocates some memory and copies to and from?

The could be lots of reasons why - for example, are you really sure that each node in the MPI communicator is actually connecting to a different GPU? You could either be hitting compute exclusivity problems or seeing resource contention if the GPUs aren’t actually all different devices.

Not the GPU pointer necessarily, but the GPU context. The underlying GPU context is tied to the thread that opened it and the device it was opened on. If you are running on cluster nodes with multiple GPUs, it is possible to wind up violating the context-device association, and that will make the program abort.

There are lots of ways that things can go wrong depending on your code and the topology of the cluster you are using.

shadowperi · May 28, 2010, 4:13pm

Are the nodes that fail always the same physical nodes? Can you run a reduced job on just the nodes you believe work (like only one or two)? Can you put together a much simpler case where each process in the communicator just opens a context on the GPU allocates some memory and copies to and from?

The could be lots of reasons why - for example, are you really sure that each node in the MPI communicator is actually connecting to a different GPU? You could either be hitting compute exclusivity problems or seeing resource contention if the GPUs aren’t actually all different devices.

Not the GPU pointer necessarily, but the GPU context. The underlying GPU context is tied to the thread that opened it and the device it was opened on. If you are running on cluster nodes with multiple GPUs, it is possible to wind up violating the context-device association, and that will make the program abort.

There are lots of ways that things can go wrong depending on your code and the topology of the cluster you are using.

Thanks very much for such detailed explaination… I’ll follow your suggestion to try to fix the problem. Any progress I’ll keep it updated!

Topic		Replies	Views
An error occurred when using MPI and OpenACC together nvc, nvc++ and nvfortran	11	1161	April 26, 2023
MPI and CUDA mixed programming General CUDA programming CUDA Programming and Performance	22	23962	July 27, 2010
Random segmentation fault Legacy PGI Compilers	12	1414	December 30, 2020
If use multiple GPUs, can I set device variables globally? Legacy PGI Compilers	9	15121	September 14, 2011
using all 4 GPUs in S1070 from multi-core cpu? how CUDA Programming and Performance	11	32523	December 13, 2010
GPU does not work why? Legacy PGI Compilers	9	15531	March 5, 2010
cublas<S,D,C,Z>GEMM crash on multi GPUs CUDA Programming and Performance	8	1992	November 27, 2014
CUBLAS_STATUS_MAPPING_ERROR in cublasGetMatrix() after cublasDgemm() GPU-Accelerated Libraries	10	9392	February 21, 2013
NV 24.1 Default MPI seg faulting on derived type host_data MPI calls - sometimes nvc, nvc++ and nvfortran	15	954	June 6, 2024
CUDA 2.0 seems to fail for long executions multiple process on one card fail CUDA Programming and Performance	5	7503	June 16, 2008

about running cuda on a gpu cluster

Related topics