I used MPI and Openacc to parallelize multiple GPU hosts. Before I used the cublas library, everything was relatively normal, but when I added cublasCgemm to the program, there was a problem
The code is as follows, it is a function, but I cannot provide all the code
void CBF(float *pow,cuComplex *s_fft,cuComplex *a_many,cublasHandle_t handle,
int start,int end,int M,int N,int NN,int level_Beam,int V_Beam)
{
cuComplex alpha={1,0};
cuComplex beta={0,0};//
int uu,vv,l;//
float a_acc=0;//
cuComplex *Temp=(cuComplex*)malloc(NN* sizeof(cuComplex));//
cuComplex *a_one=(cuComplex*)malloc(M*N* sizeof(cuComplex));//
#pragma acc data create(Temp[0:NN],a_one[0:280])
{
for(uu=start;uu<end;uu++)
for(vv=0;vv<11;vv++)
{
for(l=0;l<M*N;l++) a_one[l]=a_many[(uu*11+vv)*M*N+l];
#pragma acc update device(a_one)
#pragma acc host_data use_device(s_fft,a_one,Temp)
{
cublasCgemm(handle,CUBLAS_OP_N, CUBLAS_OP_T, 240000,1,280, &alpha,s_fft,240000,a_one,1, &beta,Temp,240000);//280*240000
}
//
a_acc=0;
#pragma acc kernels loop present(Temp[0:NN]) reduction(+:a_acc)
for(l = 0; l<NN; l++ )
{
// if(i<10) printf("temp[i]=%f %f\n",Temp[i].x,Temp[i].y);
a_acc=a_acc+my_abs(Temp[l]);
}
pow[uu*11+vv]=a_acc;
}
}
}
The above code has been run independently through Openacc, and there is no problem. There is no problem running mpi code on a single machine, but once I use mpi to run on two machines, the results are as follows
orin@orin-desktop:~/Desktop/mpi$ mpiexec --hostfile host -np 2 mpicode
--------------------------------------------------------------------------
[[65144,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: orin-desktop
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
terminate called after throwing an instance of 'std::system_error'
what(): Invalid argument
[orin-1:501480] *** Process received signal ***
[orin-1:501480] Signal: Aborted (6)
[orin-1:501480] Signal code: (-6)
[orin-1:501480] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
CUPTI ERROR: cuptiActivityEnable(CUPTI_ACTIVITY_KIND_KERNEL) returned: CUPTI_ERROR_INSUFFICIENT_PRIVILEGES,
at ../../src-cupti/prof_cuda_cupti.c:338.
[orin-desktop:2499968] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[orin-desktop:2499968] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 501480 on node node2 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
After I commented out cublasCgemm(), the result is as follows
orin@orin-desktop:~/Desktop/mpi$ mpiexec --hostfile host -np 2 mpicode
--------------------------------------------------------------------------
[[62323,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: orin-1
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
CUPTI ERROR: cuptiActivityEnable(CUPTI_ACTIVITY_KIND_KERNEL) returned: CUPTI_ERROR_INSUFFICIENT_PRIVILEGES,
at ../../src-cupti/prof_cuda_cupti.c:338.
microseconds on OpenACC
591287 microseconds on OpenACC
So I think the problem lies in the cublasCgemm() function, but why does this function cause MPI multi machine failures?