CUDA samples: 6_Advanced/conjugateGradientMultiDeviceCG can't run with big size matrix?

Host machine GPU: DGX-1 (32GB*8)
OS: Ubuntu 20.04
CUDA version: 11.4

I tried the CUDA samples: 6_Advanced/conjugateGradientMultiDeviceCG in two docker environments:

  1. CUDA10.1_Samples in 10.1-devel-centos7,
  2. CUDA11.4_Samples in 11.4.2-devel-ubuntu20.04.

Both received the same results below:
I. Successfully run on the original code.
II. I modified matrix dimension (int N=10485760 * 2) to (long long int N=2,200,000,000 or bigger), then error message “Segmentation fault (core dumped)” comes.
(I do the modification because I want to make the matrix size big enough to exceed 1 single GPU’s memory(32GB))


  1. why the II situation has the error message?
  2. How to prove the program can run with the memory size of the matrix exceeding at least one GPU’s memory(32GB)?

If the only thing you did was change the type of N, that could not possibly work. Let’s look at the code:

int main(int argc, char **argv) {
  constexpr size_t kNumGpusRequired = 2;
  int N = 0, nz = 0, *I = NULL, *J = NULL;
  /* Generate a random tridiagonal symmetric matrix in CSR format */
  N = 10485760 * 2;
  nz = (N - 2) * 3 + 4;

I think if you study the above code, you may spot a problem if the only type you modify is for N.

A segmentation fault is always a host code issue, and a segmentation fault can always be localized to a single line of source code that caused the fault. None of these statements have anything to do with CUDA, and the fact that this a CUDA code does not affect any of those statements. You may wish to learn how to isolate a seg fault to a specific line of source code that caused it. I generally find that a useful first step in understanding the cause of a seg fault. Possible isolation methods include divide-and-conquer using printf, or the use of a (host code) debugger such as gdb. I’m sure there are other methods as well.

Thank you for the response.

I forgot to mention that I also changed int to long long int for nz.

In my experience, segmentation fault always happens in the line of cudaMallocManaged().
I suppose matrix size is too large to store in the memory.
Since host memory usually is at least twice large as device memory,
I guess the error should be caused by the inadequate device memory.
But in this case, CPU memory is 512GB, and GPU memory is 32GB * 8 = 256GB.
why this error message still come?

In my experience, cudaMallocManaged never causes a seg fault. That is not what happens if you run out of memory, for the cudaMallocManaded call, itself.

But we don’t have to depend on either your experience or my experience. If you want to find out the reason for the seg fault, my suggestion is that you first localize the seg fault to a specific line of code. I’ve already given suggestions for that.

Thank you for the guidance.
The problem is solved after I change almost every “int” to “long long int”,
and test of N = 6,000,000,000 is passed.

Sorry, my memory of my experience was wrong.
I encountered the problem below:
I wrote my own program and run on Dell G3 3579(OS: CentOS 7, GPU: GTX1050 4G)


int N = 600*600;

542 int A_size = N * N;
543 int x_size = N;
544 int b_size = N;
545 cudaMallocManaged(&A, A_size * sizeof(double));
546 cudaMallocManaged(&x, x_size * sizeof(double));
547 cudaMallocManaged(&b, b_size * sizeof(double));
548 Initialization_double<<<BLOCKS, THREADS>>>(A_size, A);
549 checkCudaErr(cudaDeviceSynchronize());
550 Initialization_double<<<BLOCKS, THREADS>>>(x_size, x);
551 checkCudaErr(cudaDeviceSynchronize());
552 Initialization_double<<<BLOCKS, THREADS>>>(b_size, b);
553 checkCudaErr(cudaDeviceSynchronize());
555 Give_A_value_arbitrary_size<<<BLOCKS, THREADS>>>(Num, p, list_size, list, cal_method, A);
556 checkCudaErr(cudaDeviceSynchronize());

I got the below two messages:

  1. warning: Cuda API error detected: cudaMallocManaged returned (0x2) -------> from line 545
  2. Thread 1 received signal CUDA_EXCEPTION_14, Warp Illegal Address.
    0x0000000000fdcd30 in Give_A_value_arbitrary_size<<<(32,1,1),(256,1,1)>>> ---------> from the inside of function at line 555

But when I used N equals several thousand, there was no problem at all.
So I thought the problem was caused by large N exceeding GPU memory.
Could you help me with this problem?

Do you believe that is a good idea?

N = 600x600 = 360,000

what happens when you then do this:

int A_size = N * N;

will that number fit in a int quantity?

(hint: it will not)

This is an issue associated with C/C++ programming, not really related to CUDA at all.

By the way, that number is around 129 Billion. Even if you “fix” the above issue by converting A_size to be e.g. a long long int, you will likely run into trouble here:

545 cudaMallocManaged(&A, A_size * sizeof(double));

sizeof(double) is 8, so you are effectively asking for a ~1Terabyte allocation. I really doubt that is going to be successful. You’re headed in the wrong direction here, and I won’t be able to help any further if you believe that the right thing to do is to try to allocate 1TB.

1 Like

So normally, the size of a matrix that can be fully stored is very limited(compared to my needs).
Yes, I changed the direction and worked well so far.
Thank you very much for helping me a lot.