What I tried to do was to simply apply cublasDgemm (matrix-matrix multiplication) on several matrices with “double” (8 bytes) type element all of which have one dimension that is very large. In my case, the sizes of the matrices are 12755046 by 46. Simply say, A[46,12755046]*B_i[12755046,46] = C_i[46,46], where i = 1,2,3,…

The machine includes 128GB memory and two GTX2080Ti (11GB GPU memory) so my original strategy was to distribute B_i to each GPU. However, I always get INTERNAL ERROR when I execute my code on two GPUs.

So I solved this problem by trying three things: 1. use one GPU only. No error. 2. downsize the matrix size but keep using two GPUs. No error. 3. use cublasXt which implicitly uses two GPUs. No error.

Though it is solved, I am still interested in finding an answer to why my original plan did not work for large dimension matrix? I am guessing this could be due to some internal limitations from cublas or I missed some configurations?

I attached my simplified code here to illustrate my original plan:

double *A, B[2], C[2];
cudaMallocManaged(&A, 4612755046sizeof(double));
cudaMallocManaged(&B[0], 46*12755046

*sizeof(double));*

cudaMallocManaged(&B[1], 4612755046

cudaMallocManaged(&B[1], 46

*sizeof(double));*

cudaMallocManaged(&C[0], 4612755046

cudaMallocManaged(&C[0], 46

*sizeof(double));*

cudaMallocManaged(&C[1], 4612755046*sizeof(double));

cudaMallocManaged(&C[1], 46

givevalueto(A);

givevalueto(B[0]);

givevalueto(B[1]);

double alpha = 1.0;

double beta = 0.0;

cublasHandle_t handle[nGPUs];

int iGPU;

for(iGPU=0;iGPU<nGPUs;iGPU++)

{

cublasCreate (& handle[iGPU]);

}

for(iGPU=0;iGPU<nGPUs;i++)

{

cudaSetDevice(iGPU);

cublasDgemm(handle[iGPU],CUBLAS_OP_N,CUBLAS_OP_N,46,46,12755046,&alpha,A,46,B[iGPU],12755046,&beta,C[iGPU],46);

}

for(iGPU=0;iGPU<nGPUs;i++)

{

cudaSetDevice(iGPU);

cudaDeviceSynchronize();

}

for(iGPU=0;iGPU<nGPUs;iGPU++)

{

cudaFree(B[iGPU]);

}