PGI fortran large matrix

I’m trying to use cusolver on fortran and I succeeded using a 3 by 3 matrix. However, when I try to use a larger, sparse matrix (348k by 348k), the program crashes with what I suspect was insufficent memory.

I tried with -stack=50000000

Would anyone please advise?

  real, allocatable :: A(:,:),ATest(:,:)
  real, allocatable :: B(:,:)
  real, allocatable :: X(:,:)
  integer, allocatable::Ipiv(:)
  integer, target::Lwork
  real, allocatable :: IRN(:), JCN(:), val(:)

......

  open(121, file='Input.dat')
    read(121, *) lda, ldb, nnz       ! <--- 348780, 348780, 4492548

  allocate(A(lda,ldb))
  allocate(ATest(lda,ldb))
  allocate(B(n,1))
  allocate(X(n,1))
  allocate(Ipiv(lda))
  allocate(IRN(nnz))
  allocate(JCN(nnz))
  allocate(val(nnz))

  A_size = SIZEOF(A)
  B_size = SIZEOF(B)
  X_size = SIZEOF(X)
  Ipiv_size = SIZEOF(Ipiv)
  devInfo_size = SIZEOF(devInfo)
  Lwork_size = SIZEOF(Lwork)

  do i=1, nnz
    read(121, *) IRN(i), JCN(i), val(i)
    col = IRN(i)
    row = JCN(i)
    A(row,col)  = val(i)
    print *, row," ",col, " ", val(i)     !<----- error happens when col > 8000
  enddo
  close(121)

Thank you.

Hi Ceeely,

When using allocatable arrays larger than 2GB, you need to compile with the flag “-Mlarge_arrays” or “-mcmodel=medium”. You may also need to have your index variables be declared integer(8) in order to hold the larger index values.

You may still have problems though given a 348780x348780 single precision array is over 450GB and you have 2 of them. Unless you have a really big system, you should consider distributing this code across multiple MPI ranks and multiple nodes.

  • Mat

hmm, that doesn’t work, only the program crashed with

0: ALLOCATE: 1258649152 bytes requested; not enough memory

I’m thinking:

since it’s a sparse matrix and I intend to use cusolversp, perhaps I can interface or convert the C++ example from the Nvidia sample pack to fortran to run on the PGI compiler? Do you have any experience on this?

Thanks, Mat.

This means that your device ran out of memory which is a hard limit. You’ll either need to either:

    • Only run smaller problems.
  • Get a new card with more memory

  • Write the algorithm to block the compute so that it’s only working on a sub-set of the data at a time where that data fits on the device

  • Distribute the computation across multiple devices.

  • Mat

Thanks for the tip, Mat.

  • Write the algorithm to block the compute so that it’s only working on a sub-set of the data at a time where that data fits on the device

I’m no math genius but how can we do that? is there a reference code that I can refer to?

Here’s an example from my chapter in the “Parallel Programming with OpenACC” book.


Note that using the acc_map_data api is optional. You could also use data regions to copy the data to/from the device. Mostly the example is to show you what I mean about blocking.

  • Mat