An error occurred when using MPI and OpenACC together

I used MPI and Openacc to parallelize multiple GPU hosts. Before I used the cublas library, everything was relatively normal, but when I added cublasCgemm to the program, there was a problem
The code is as follows, it is a function, but I cannot provide all the code

void CBF(float *pow,cuComplex *s_fft,cuComplex *a_many,cublasHandle_t handle,
int start,int end,int M,int N,int NN,int level_Beam,int V_Beam)
{
   	  cuComplex alpha={1,0};
  	  cuComplex beta={0,0};//
   	  int uu,vv,l;//     
   	  float a_acc=0;//
           cuComplex *Temp=(cuComplex*)malloc(NN* sizeof(cuComplex));//
           cuComplex *a_one=(cuComplex*)malloc(M*N* sizeof(cuComplex));//    
              
           #pragma acc data create(Temp[0:NN],a_one[0:280])
           {         		  
            for(uu=start;uu<end;uu++)
            for(vv=0;vv<11;vv++)
            {
            	for(l=0;l<M*N;l++) a_one[l]=a_many[(uu*11+vv)*M*N+l];         
		#pragma acc update device(a_one)
		#pragma acc host_data use_device(s_fft,a_one,Temp)
		{
		cublasCgemm(handle,CUBLAS_OP_N, CUBLAS_OP_T, 240000,1,280, &alpha,s_fft,240000,a_one,1, &beta,Temp,240000);//280*240000 			
		}
	//
	a_acc=0;		
     #pragma acc kernels loop present(Temp[0:NN])  reduction(+:a_acc)    
      for(l = 0; l<NN; l++ )
      {
//     if(i<10) printf("temp[i]=%f   %f\n",Temp[i].x,Temp[i].y);
        a_acc=a_acc+my_abs(Temp[l]);
  	}      
   		 
        pow[uu*11+vv]=a_acc;
		}
			
           }
}

The above code has been run independently through Openacc, and there is no problem. There is no problem running mpi code on a single machine, but once I use mpi to run on two machines, the results are as follows

orin@orin-desktop:~/Desktop/mpi$ mpiexec --hostfile host  -np 2 mpicode
--------------------------------------------------------------------------
[[65144,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: orin-desktop

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
terminate called after throwing an instance of 'std::system_error'
  what():  Invalid argument
[orin-1:501480] *** Process received signal ***
[orin-1:501480] Signal: Aborted (6)
[orin-1:501480] Signal code:  (-6)
[orin-1:501480] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
CUPTI ERROR: cuptiActivityEnable(CUPTI_ACTIVITY_KIND_KERNEL) returned: CUPTI_ERROR_INSUFFICIENT_PRIVILEGES, 
	 at ../../src-cupti/prof_cuda_cupti.c:338.
[orin-desktop:2499968] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[orin-desktop:2499968] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 501480 on node node2 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

After I commented out cublasCgemm(), the result is as follows

orin@orin-desktop:~/Desktop/mpi$ mpiexec --hostfile host  -np 2 mpicode
--------------------------------------------------------------------------
[[62323,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: orin-1

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
CUPTI ERROR: cuptiActivityEnable(CUPTI_ACTIVITY_KIND_KERNEL) returned: CUPTI_ERROR_INSUFFICIENT_PRIVILEGES, 
	 at ../../src-cupti/prof_cuda_cupti.c:338.
 microseconds on OpenACC
591287 microseconds on OpenACC

So I think the problem lies in the cublasCgemm() function, but why does this function cause MPI multi machine failures?

The Open Fabrics missing library warning should be safe to ignore.

While I’m not sure why profiling is enabled, but the CUPTI error (cupti is the low level profile library) is likely safe to ignore as well.

As to why you’re getting the sig 6 error, I’m not sure. It doesn’t make sense to me why the error would be with cublasCgemm since it doesn’t have a notion of multiple ranks.

There is no problem running mpi code on a single machine, but once I use mpi to run on two machines, the results are as follows

Are you running multiple ranks on the single system? In other words, is this a multi-rank issue or a multi-node issue?

How are you doing the rank to device binding?

Perhaps what I said is not very clear. I have made a simple code example as follows. This code runs MPI+OpenACC programs on two NVIDIA AGX Orins, and OpenACC and cublasScasum calculations are performed on each machine. The results are quite obvious. On the host program, the cublas time is normal, but on the other node, the cublas time is incorrect


#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <openacc.h>
//#include <accelmath.h>
#include <cublas_v2.h>
//#include "cublasXt.h"

#include "mpi.h"
#if defined(_WIN32) || defined(_WIN64)
#include <sys/timeb.h>
#define gettime(a) _ftime(a)
#define usec(t1,t2) ((((t2).time-(t1).time)*1000+((t2).millitm-(t1).millitm))*100)
typedef struct _timeb timestruct;
#else
#include <sys/time.h>
#define gettime(a) gettimeofday(a,NULL)
#define usec(t1,t2) (((t2).tv_sec-(t1).tv_sec)*1000000+((t2).tv_usec-(t1).tv_usec))
typedef struct timeval timestruct;
#endif


int main( int argc, char* argv[] )
{
    	int  myid, numprocs;	
    	MPI_Init(&argc,&argv);
    	MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
    	MPI_Comm_rank(MPI_COMM_WORLD,&myid);	
    	MPI_Status status;
//    	char hname[32];
//    	gethostname(hname,32);
    int n;      /* size of the vector */
    int i;
    timestruct t1, t2, t3;
    long long t_acc, t_blas;
    if( argc > 1 )
        n = atoi( argv[1] );
    else
        n = 100005;
    if( n <= 0 ) n = 100005;
	cublasStatus_t stat = CUBLAS_STATUS_SUCCESS;
        cublasHandle_t handle;
        cublasCreate(&handle);
	stat = cublasCreate(&handle);
	if ( CUBLAS_STATUS_SUCCESS != stat ) {
	printf("CUBLAS initialization failed\n");
		}        
        float a_acc={0} ;
        float a_blas={0};
        cuComplex* e_count = (cuComplex*)malloc(n*sizeof(cuComplex));
        const int incx=1;
    for( i = 0; i < n; ++i )
    {
     e_count[i].x = 1.3;
     e_count[i].y = 1.2;
    }
    #pragma acc  data  copyin(e_count[0:n],n)
   {
       gettime( &t1 );
    #pragma acc parallel reduction(+:a_acc)
    for( i = 0; i < n; ++i ){
        a_acc=a_acc+e_count[i].x+e_count[i].y;
    }
    gettime( &t2 );
    #pragma acc host_data use_device(e_count)
    {

        cublasScasum(handle,n,e_count,incx,&a_blas);
    }
    }
    gettime( &t3 );
     t_acc = usec(t1,t2);
     t_blas = usec(t2,t3);
    printf( "a_blas=%f \n", a_blas);
    printf( "a_acc=%f \n", a_acc);
    /* check the results */
    if(a_acc!=a_blas)printf( "Test FAILED\n");
//    printf("My name is %s\n",hname);
    printf( "%13d iterations completed\n", n );
    printf( "%13ld microseconds on OpenACC\n",t_acc );
    printf( "%13ld microseconds on cublas\n",t_blas );        
	MPI_Finalize();  
    return 0;
}

give the result as follows

[[40454,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: orin-desktop

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
a_blas=0.000000 
a_acc=250012.500000 
Test FAILED
       100005 iterations completed
          683 microseconds on OpenACC
           35 microseconds on cublas
CUPTI ERROR: cuptiActivityEnable(CUPTI_ACTIVITY_KIND_KERNEL) returned: CUPTI_ERROR_INSUFFICIENT_PRIVILEGES, 
	 at ../../src-cupti/prof_cuda_cupti.c:338.
a_blas=250012.500000 
a_acc=250012.500000 
       100005 iterations completed
          279 microseconds on OpenACC
          176 microseconds on cublas

I’m not sure I’m reproducing your error correctly, but I’ll walk through what I’m seeing.

I jumped on to a Orin system with CUDA 11.4 installed. I compiled your program using NVHPC v22.3 which ships with CUDA 11.8 (as well as 12.0). The cuBLAS library fails to initialize with an error 3, unable to allocate. If I switch to using the local CUDA 11.4 install by setting the environment variable “LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64”, then the program works.

So in my case, the problem appears to be a mismatch in the CUDA version being used where the 11.8 cuBLAS version is failing to initialize on a system with a CUDA 11.4 driver.

I don’t know if this is what’s happening on your system since I would suspect the same failure would occur on both systems, but you might give it a try.

-Mat

Hello, mat, I tried to modify the content of the bashrc file

export PATH=/usr/local/cuda-11.4/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64:$LD_LIBRARY_PATH

export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/cuda-11.4/targets/aarch64-linux/include

MANPATH=$MANPATH:/opt/nvidia/hpc_sdk/Linux_aarch64/23.1/compilers/man; export MANPATH
PATH=/opt/nvidia/hpc_sdk/Linux_aarch64/23.1/compilers/bin:$PATH; export PATH

export PATH=/opt/nvidia/hpc_sdk/Linux_aarch64/23.1/comm_libs/mpi/bin:$PATH
export MANPATH=$MANPATH:/opt/nvidia/hpc_sdk/Linux_aarch64/23.1/comm_libs/mpi/man

he current error is the same as yours, indicating initialization failure. I think it may be due to my incorrect compilation method. Can you tell me how to compile mpi+openacc code?

CUBLAS initialization failed
a_blas=0.000000 
a_acc=250012.500000 
Test FAILED

I use compile commands and run as

mpicc -o out  -acc -gpu=cc87 -Mcudalib -Minfo=accel mpi_speed.c
mpiexec --hostfile host -np 2 out

The only thing I’m doing differently is the addition of the “-cuda” flag, as well as only linking with cuBLAS, i.e. “-cudalib=cublas”.

mpicc -o out -acc -gpu=cc87 -cuda -cudalib=cublas -Minfo=accel mpi_speed.c

I tried compiling again, but failed with a new error.

orin@orin-desktop:~/Desktop/new$ mpirun --hostfile host -np 2 out
--------------------------------------------------------------------------
[[61906,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: orin-1

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
CUBLAS initialization failed
[orin-1:187020] *** Process received signal ***
[orin-1:187020] Signal: Segmentation fault (11)
[orin-1:187020] Signal code: Address not mapped (1)
[orin-1:187020] Failing at address: 0x3e923e41be973e2b
[orin-1:187020] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
CUPTI ERROR: cuptiActivityEnable(CUPTI_ACTIVITY_KIND_KERNEL) returned: CUPTI_ERROR_INSUFFICIENT_PRIVILEGES, 
	 at ../../src-cupti/prof_cuda_cupti.c:338.
a_blas=250012.500000 
a_acc=250012.500000 
       100005 iterations completed
          312 microseconds on OpenACC
          167 microseconds on cublas
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 187020 on node node2 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[orin-desktop:10764] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[orin-desktop:10764] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

I think it may be due to the version mismatch you mentioned. I used the command to check some of the following things

nvc  -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:34:49_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
nvaccelinfo

CUDA Driver Version:           11040
NVRM version:                  NVIDIA UNIX Open Kernel Module for aarch64  35.1.0  Release Build  (buildbrain@mobile-u64-5273-d7000)  We
d Aug 10 20:32:39 PDT 2022

但是我已经将如下命令加入到bashrc中
export LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64:$LD_LIBRARY_PATH

I am using HPCSDK 21.3
Bundled with the newest plus two previous CUDA versions (12.0, 11.8, 11.0)

Is the LD_LIBRARY_PATH setting getting inherited by the second rank?

Typically this is done by wrapping the binary in a shell script and then launching the job using the script. Something like:

#!/bin/bash
EXE=$1
export LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64:$LD_LIBRARY_PATH
$EXE

mpirun --hostfile host -np 2 bash run.sh out

Thank you very much. I’m going to learn how to use script files. I just tried creating a new run.sh file in the current folder and writing the command you provided into it, then running it
mpirun --hostfile host -np 2 bash run.sh out
However, the prompt command is as follows

run. sh: Line 4: out: Command not found

Try

mpirun --hostfile host -np 2 bash run.sh ./out

or

mpirun --hostfile host -np 2 bash run.sh /the/full/path/to/out

Of course, change this to be the actual path.

Oh, oh, oh, thank you very much. I have been trying for almost a week, and with your help, I have successfully solved this problem. Now, the code has been successfully run on two Orins, and the time and results are correct

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.