Inverse big matrix distributed on both GPU A6000 with NVLink

Hello,

From examples in doc of MAGMA, I try to do a simple program from MAGMA doc which inverses a large matrix using in the same time 2 GPU cards RTX A6000 with the
hardware component NvLink. Ideally, I would like to combine the power of the 2 GPU cards.

With only one GPU card, I can inverse a matrix of size ~ 50,000 x 50,000.

For my code, I need to inverse a matrix of size 120,000 x 120,000 : so I wonder if it
could be possible to use simultaneously both cards to carry out this inversion.

Here is my flags for compilation :

CXX = nvcc -O3
LAPACK = /opt/intel/oneapi/mkl/latest/lib/intel64
MAGMA = /usr/local/magma
INCLUDE_CUDA=/usr/local/cuda/include
LIBCUDA=/usr/local/cuda/lib64
CXXFLAGS = -c -I${MAGMA}/include -I${INCLUDE_CUDA} -lpthread
LDFLAGS = -L${LAPACK} -lmkl_intel_lp64 -L${LIBCUDA} -lcuda -lcudart -lcublas -L${MAGMA}/lib -lmagma -lpthread
SOURCES = example_double_MAGMA_NVIDIA.cpp
EXECUTABLE = main_magma_double_example.exe

and here is the last version of my attempt of this code :

#include <stdio.h>
#include <cuda.h>
#include “magma_v2.h”
#include “magma_lapack.h”
#define min(a,b) (((a)<(b))?(a):(b))

int main( int argc , char** argv)
{
magma_init (); // initialize Magma
int num_gpus = 2;
magma_setdevice (0);
magma_queue_t queues[num_gpus ];
for( int dev = 0; dev < num_gpus; ++dev ) {
magma_queue_create( dev , &queues[dev] );
}
magma_int_t err;
real_Double_t cpu_time ,gpu_time;
magma_int_t m = 8192, n = 8192; // a,r - mxn matrices
magma_int_t mm = mn;
magma_int_t nrhs =100; // b - nxnrhs , c - mxnrhs matrices
magma_int_t ipiv; // array of indices of interchanged rows
magma_int_t n2=m
n; // size of a,r
magma_int_t nnrhs=n
nrhs; // size of b
magma_int_t mnrhs=m*nrhs; // size of c
double *a, *r; // a,r - mxn matrices on the host
double *b, *c;// b - nxnrhs , c - mxnrhs matrices on the host
double *dwork; // dwork - workspace
magmaDouble_ptr d_la[num_gpus ];
double alpha =1.0, beta =0.0; // alpha=1,beta=0
magma_int_t ldwork; // size of dwork
ldwork = m * magma_get_dgetri_nb( m ); // optimal block size

//4.3 LU decomposition and solving general linear systems 282
magma_int_t n_local;
magma_int_t ione = 1, info;
magma_int_t i, min_mn=min(m,n), nb;
magma_int_t ldn_local;// mxldn_local - size of the part of a
magma_int_t ISEED [4] = {0,0,0,1}; // on i-th device
nb =magma_get_dgetrf_nb(m,n); // optim.block size for dgetrf
// allocate memory on cpu
ipiv=( magma_int_t ) malloc(min_mnsizeof(magma_int_t ));

// host memory for ipiv
err = magma_dmalloc_cpu (&a,n2); // host memory for a
err = magma_dmalloc_pinned (&r,n2); // host memory for r
err = magma_dmalloc_pinned (&b,nnrhs); // host memory for b
err = magma_dmalloc_pinned (&c,mnrhs); // host memory for c

// allocate device memory on num_gpus devices
for(i=0; i<num_gpus; i++){
n_local = ((n/nb)/ num_gpus )*nb;
if (i < (n/nb)% num_gpus)
n_local += nb;
else if (i == (n/nb)% num_gpus)
n_local += n%nb;
ldn_local = (( n_local +31)/32)32;
magma_setdevice(i);
err = magma_dmalloc (&d_la[i],m
ldn_local ); // device memory
} // on i-th device
magma_setdevice (0);

lapackf77_dlarnv (&ione ,ISEED ,&mm ,a); // randomize a

// copy the corresponding parts of the matrix r to num_gpus
magma_dsetmatrix_1D_col_bcyclic( num_gpus , m, n, nb , a, m, d_la , m, queues );

// MAGMA
// LU decomposition on num_gpus devices with partial pivoting
// and row interchanges , row i is interchanged with row ipiv(i)
gpu_time = magma_sync_wtime(NULL);
magma_dgetrf_mgpu( num_gpus, m, n, d_la, m, ipiv, &info);
magma_dgetri_gpu(m, a, m, ipiv, dwork, ldwork, &info);
gpu_time = magma_sync_wtime(NULL)-gpu_time;
printf(“magma_dgetrf_mgpu time: %7.5f sec.\n”,gpu_time );

// print part of the solution from dgetrf_mgpu and dgetrs
printf(“upper left corner of a^-1*a:\n”);
magma_dprint( 4, 4, a, m); // magma_dgetrf_mgpu + dgetrs
free(ipiv); // free host memory
free(a); // free host memory
magma_free_pinned(r); // free host memory
magma_free_pinned(b); // free host memory
magma_free_pinned(c); // free host memory
for(i=0; i<num_gpus; i++){
magma_free(d_la[i] ); // free device memory
}
for( int dev = 0; dev < num_gpus; ++dev ) {
magma_queue_destroy( queues[dev] );
}
magma_finalize ();
}

Everything compiles fine but get following errors at execution :

CUBLAS error: memory mapping error (11) in magma_dtrtri_gpu at /home/henry/magma-2.6.1/src/dtrtri_gpu.cpp:162
CUBLAS error: memory mapping error (11) in magma_dtrtri_gpu at /home/henry/magma-2.6.1/src/dtrtri_gpu.cpp:172
CUBLAS error: memory mapping error (11) in magma_dtrtri_gpu at /home/henry/magma-2.6.1/src/dtrtri_gpu.cpp:162
CUBLAS error: memory mapping error (11) in magma_dtrtri_gpu at /home/henry/magma-2.6.1/src/dtrtri_gpu.cpp:172
CUDA runtime error: an illegal memory access was encountered (700) in magma_dtrtri_gpu at /home/henry/magma-2.6.1/src/dtrtri_gpu.cpp:173
CUBLAS error: memory mapping error (11) in magma_dtrtri_gpu at /home/henry/magma-2.6.1/src/dtrtri_gpu.cpp:162
CUDA runtime error: an illegal memory access was encountered (700) in magma_dtrtri_gpu at /home/henry/magma-2.6.1/src/dtrtri_gpu.cpp:163

However, I think I have initialized well the variable d_la but it seems that errors of coding remain.

EDIT :

I was told by a HPC engineer :

“The easiest way will be to use the Makefiles until we figure out how cmake can support that. If you do that, you can just replace LAPACKE_dgetrf by magma_dgetrf. MAGMA will use internally one GPU with out-of-memory algorithm that fill factor the matrix, even if it is large and does not fir into the memory of the GPU.”

Does it mean that I have to find the appropriate flags of Makefile to be able to use magma_dgetrf instead of LAPACKE_dgetrf ?

And for the second sentence, it is said that

“MAGMA will use internally one GPU with out-of-memory algorithm that fill factor the matrix”

Does it mean that if my matrix is over 48GB, then MAGMA will be able to fill the rest into the second GPU A6000 or in the RAM and perform the inversion of the full matrix ?

Please, let me know which flags to use to build correctly MAGMA in my case.

Currrently, I do :

$ mkdir build && cd build
$ cmake -DUSE_FORTRAN=ON  \
-DGPU_TARGET=Ampere \
-DLAPACK_LIBRARIES="/opt/intel/oneapi/intelpython/latest/lib/liblapack.so" \
-DMAGMA_ENABLE_CUDA=ON ..
$ cmake --build . --config Release

If someone could help to fix this bug at execution, this woud be fine.

Any help is welcome,

Regards

Hi christophe.petit09,

Since MAGMA is a third party library, my suggestion would be to post this question on their user forum since they should be better able to help. https://groups.google.com/a/icl.utk.edu/g/magma-user

Though, I see that this question has already been asked and responded to, though by someone with a different email address: https://groups.google.com/a/icl.utk.edu/g/magma-user/c/hIGzMpKVELM

-Mat