cusolverDnSgetrf() fails on A100 (but not on A10) when called in a tight loop

I am new to CUDA programming. I am doing some experiments with the cuSOLVER Lapack APIs. One sample program calls cusolverDnSgetrf() and cusolverDnSgetrs() in an infinite loop. This code runs fine on A10. However, on A100 (A100-SXM4-40GB) it runs fine for a few iterations and then the cudaDeviceSynchronize() called after cusolverDnSgetrf() fails with error code 700 (cudaErrorIllegalAddress).

  • For matrix size 10000, the failure happens in the 72nd iteration.

  • For matrix size 20000, the failure happens in the 49th iteration.

  • For matrix size 30000, the failure happens in the 42nd iteration.

  • For matrix size 40000, the failure happens in the 39th iteration.

  • Ran matrix size 100 for ~50000 iterations. No failures.

  • Ran matrix size 1000 for ~5000 iterations. No failures.

  • Ran matrix size 5000 for ~1000 iterations. No failures.

Tried cuda-memcheck on the A100. There were 112 errors, all saying “Invalid global read of size 4” in getrf3_swap_alg1(). Some of the error messages are given below.

Since I use an infinite loop, nothing was reported in A10 where the program has no failures. I changed the code to execute for only 100 iterations. cuda-memcheck did not report any errors on A10.

The code looks something like this

    cuda_stat = cudaMallocManaged(&a, (n * n * sizeof(*a)));
    CUDA_CHECK(cuda_stat, "cudaMallocManaged");

    cuda_stat = cudaMallocManaged(&b, (n * sizeof (*b)));
    CUDA_CHECK(cuda_stat, "cudaMallocManaged");

    cuda_stat = cudaMallocManaged(&x, (n * sizeof (*x)));
    CUDA_CHECK(cuda_stat, "cudaMallocManaged");

    cuda_stat = cudaMallocManaged(&pivot, (n * sizeof (*pivot)));
    CUDA_CHECK(cuda_stat, "cudaMallocManaged");

    cuda_stat = cudaMallocManaged(&info, sizeof (*info));
    CUDA_CHECK(cuda_stat, "cudaMallocManaged");

    cusolver_stat = cusolverDnCreate(&handle);

    if (cusolver_stat != CUSOLVER_STATUS_SUCCESS) {
        printf ("cuSOLVER initialization failed with error code %d\n", cusolver_stat);
        exit(-1);
    }

.  .  .
.  .  .
    cusolver_stat = cusolverDnSgetrf_bufferSize(handle, n, n, a, lda, &wspace_size);

    cuda_stat = cudaDeviceSynchronize();
    CUDA_CHECK(cuda_stat, "cudaDeviceSynchronize");

    if (cusolver_stat != CUSOLVER_STATUS_SUCCESS) {
        printf ("cuSOLVER dense getrf_bufferSize failed with error code %d\n", cusolver_stat);
        exit(-1);
    }

    cuda_stat = cudaMallocManaged(&wspace, (wspace_size * sizeof (*wspace)));
    CUDA_CHECK(cuda_stat, "cudaMallocManaged");

i = 1;

while (1) {
printf("loop=%d\n", i); fflush(NULL);i++;

    cusolver_stat = cusolverDnSgetrf(handle, n, n, a, lda, wspace, pivot, info);

    cuda_stat = cudaDeviceSynchronize();
    CUDA_CHECK(cuda_stat, "cudaDeviceSynchronize");

    if (cusolver_stat != CUSOLVER_STATUS_SUCCESS) {
        printf ("cuSOLVER dense getrf failed with error code %d\n", cusolver_stat);
        exit(-1);
    }

    if (*info != 0) {
        printf ("cuSOLVER dense getrf finished but set error info to %d\n", *info);
        exit(-1);
    }   

    cublasOperation_t trans = CUBLAS_OP_N;
    int nrhs = 1;
    int ldb = n;

    cusolver_stat = cusolverDnSgetrs(handle, trans, n, nrhs, a, lda, pivot, b, ldb, info);

    cuda_stat = cudaDeviceSynchronize();
    CUDA_CHECK(cuda_stat, "cudaDeviceSynchronize");

    if (cusolver_stat != CUSOLVER_STATUS_SUCCESS) {
        printf ("cuSOLVER dense getrs failed with error code %d\n", cusolver_stat);
        exit(-1);
    }

    if (*info != 0) {
        printf ("cuSOLVER dense getrs finished but set error info to %d\n", *info);
        exit(-1);
    }

} /* while (1) */

The input matrices are initialized only once, before the infinite loop. So after the first iteration, whatever is inside the matrices are passed as an input in the subsequent iterations. This is not a problem for my program as it does not do any data validations. I use this program to see its effect on the GPU power.

I thought that perhaps I needed cudaDeviceSynchronize() after the calls to cudaMallocManaged(). So I added it in two places : once after the first set of cudaMallocManaged() calls and again after the cudaMallocManaged() for wspace. No change in behaviour.

All 112 cuda-memcheck errors on A100 seems to happen in the same code location. Given below are 3 errors :

========= Invalid __global__ read of size 4
=========     at 0x000007f0 in void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int) 
=========     by thread (15,0,0) in block (39,0,0)
=========     Address 0x7f9089d8203c is out of bounds
=========     Device Frame:void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int) (void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int) : 0x7f0)
=========     Saved host backtrace up to driver entry point at kernel launch time 
=========     Host Frame:/lib64/libcuda.so [0x20d6ea]
=========     Host Frame:./nvgpu_lapack [0x20af0b]
=========     Host Frame:./nvgpu_lapack [0x261548]
=========     Host Frame:./nvgpu_lapack [0x5f524]
=========     Host Frame:./nvgpu_lapack [0x611d1]
=========     Host Frame:./nvgpu_lapack [0x52fb3]
=========     Host Frame:./nvgpu_lapack [0x4bc30]
=========     Host Frame:./nvgpu_lapack [0x4d141]
=========     Host Frame:./nvgpu_lapack [0xd64f]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22555]
=========     Host Frame:./nvgpu_lapack [0x4b2b9]
=========
========= Invalid __global__ read of size 4
=========     at 0x000007f0 in void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int) 
=========     by thread (14,0,0) in block (39,0,0)
=========     Address 0x7f9089d82038 is out of bounds
=========     Device Frame:void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int) (void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int) : 0x7f0)
=========     Saved host backtrace up to driver entry point at kernel launch time 
=========     Host Frame:/lib64/libcuda.so [0x20d6ea]
=========     Host Frame:./nvgpu_lapack [0x20af0b]
=========     Host Frame:./nvgpu_lapack [0x261548]
=========     Host Frame:./nvgpu_lapack [0x5f524]
=========     Host Frame:./nvgpu_lapack [0x611d1]
=========     Host Frame:./nvgpu_lapack [0x52fb3]
=========     Host Frame:./nvgpu_lapack [0x4bc30]
=========     Host Frame:./nvgpu_lapack [0x4d141]
=========     Host Frame:./nvgpu_lapack [0xd64f]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22555]
=========     Host Frame:./nvgpu_lapack [0x4b2b9]
=========
========= Invalid __global__ read of size 4
=========     at 0x000007f0 in void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int)
=========     by thread (13,0,0) in block (39,0,0)
=========     Address 0x7f9089d82034 is out of bounds
=========     Device Frame:void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int) (void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int) : 0x7f0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib64/libcuda.so [0x20d6ea]
=========     Host Frame:./nvgpu_lapack [0x20af0b]
=========     Host Frame:./nvgpu_lapack [0x261548]
=========     Host Frame:./nvgpu_lapack [0x5f524]
=========     Host Frame:./nvgpu_lapack [0x611d1]
=========     Host Frame:./nvgpu_lapack [0x52fb3]
=========     Host Frame:./nvgpu_lapack [0x4bc30]
=========     Host Frame:./nvgpu_lapack [0x4d141]
=========     Host Frame:./nvgpu_lapack [0xd64f]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22555]
=========     Host Frame:./nvgpu_lapack [0x4b2b9]
=========

I would like to understand

  • what is causing the failure on A100 (A100-SXM4-40GB). The addresses
    passed to the APIs do not change between iterations
  • why this program runs fine on A10 but fails on A100
  • why it is able to run fine for some time on A100 before the failure
  • why it does not (seem to) fail on A100 for matrix sizes smaller
    than 10000.

Thanks

Attaching the full output of the cuda-memcheck tool on the A100.
cudamemchk_errs_1.txt (127.3 KB)

I am using CUDA 11.4 with gcc9.

Have you checked result in info after each step?

I had that check for the getrs call but had not made it a part of the infinite loop. I had forgotten this check for the getrf call. I have added it now and included both checks in the infinite loop. I have edited the code in the first post.

No change in behaviour : program runs fine on A10 but fails on A100.

Okay thanks for checking.
Can you file a bug and provide a complete reproducer?

Will file a bug. However, there is no file attach feature in the bug-filing page. So I am attaching the sample program here. Will provide this link in the bug description.

sp_lin_eq1.cu (3.8 KB)

The program takes one argument, the matrix size. If not provided, 10000 will be assumed. I have tested with 10000, 20000, 30000 and 40000.

The following is seen on A100 :

  • For matrix size 10000, the failure happens in the 75th iteration.
  • For matrix size 20000, the failure happens in the 51st iteration.
  • For matrix size 30000, the failure happens in the 43rd iteration.
  • For matrix size 40000, the failure happens in the 39th iteration.

No failures seen on A10. Have ran with matrix size 40000 on A10. It ran for more than 3.5 hours with no failures and completed over 3700 iterations.

The compilation command is

$ nvcc -m64 -ccbin <gcc_path> -I <cuda_include_path> -I <math_libs_include_path> sp_lin_eq1.cu -o splineq1 -L <math_libs_path> -L <cuda_libs_path> -Xlinker -Bstatic -lcusolver_static -lcublas_static -lculibos -lcudart_static -lcublasLt_static -Xlinker -Bdynamic -ldl -lpthread -lrt

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:19_PDT_2021
Cuda compilation tools, release 11.4, V11.4.100
Build cuda_11.4.r11.4/compiler.30188945_0
$

Have filed the following bug

cudaErrorIllegalAddress reported after a call to cusolverDnSgetrf() on A100 (but not on A10) when executed in a tight loop
https://developer.nvidia.com/nvidia_bug/3496716

Thanks, in the meantime, do you mind trying to implement with our 64-bit API?

Does this mean that cusolverDnSgetrf() and cusolverDnDgetrf() contain 32-bit code?

What are the other differences between the legacy and the 64-bit APIs?

Thanks

The allow for much larger matrix sizes and are generic typed.

Similar behaviour with the 64-bit APIs too i.e failures on A100 only. However, there are two types of failures. One is the cudaErrorIllegalAddress. In the other case cusolverDnXgetrf() sets info to a non-zero value. The later happens more.

No failures seen so far on A10.

Given below are the error messages and the loop iteration in which the failure happens. The messages are from the program.

Note that the loop numbers are the same as the previous sample program.

CUSOLVER_ALG_1 (Legacy algo)
10000 : loop 75 : cusolverDnXgetrf() finished but set error info to 5884
20000 : loop 51 : sp_lin_eq2.cu:133 cudaDeviceSynchronize failed with error code 700; (cudaErrorIllegalAddress) an illegal memory access was encountered
30000 : loop 43 : cusolverDnXgetrf() finished but set error info to 25178
40000 : loop 39 : sp_lin_eq2.cu:133 cudaDeviceSynchronize failed with error code 700; (cudaErrorIllegalAddress) an illegal memory access was encountered

CUSOLVER_ALG_0 (New algo)
10000 : loop 75 : cusolverDnXgetrf() finished but set error info to 5887
20000 : loop 51 : cusolverDnXgetrf() finished but set error info to 15532
30000 : loop 43 : cusolverDnXgetrf() finished but set error info to 25010
40000 : loop 39 : cusolverDnXgetrf() finished but set error info to 35067

The new sample program is attached.
sp_lin_eq2.cu (4.9 KB)

The documentation says the following

If LU factorization failed, i.e. matrix A (U) is singular, The output parameter info=i indicates U(i,i) = 0. 

====================================================

The Xgetrf example in

CUDALibrarySamples/cusolver_Xgetrf_example.cu at master · NVIDIA/CUDALibrarySamples · GitHub

is not allocating space for the host workspace (h_work). It is passed as nullptr to Xgetrf. Is that ok? In my sample program I allocated space for it.

====================================================

You might want to ask your documentation team to take a closer look at the cuSOLVER doc.

  • In many places the word “below” is spelt “bellow”.
  • In section 2.4.4.3. cusolverDnXgetrf(), it says that workspaceInBytesOnDevice and workspaceInBytesOnHost are values returned by cusolverDnXpotrf_bufferSize instead of cusolverDnXgetrf_bufferSize.

Thanks

Thanks for checking. I’ll add this to bug report.

The Xgetrf example in

CUDALibrarySamples/cusolver_Xgetrf_example.cu at master · NVIDIA/CUDALibrarySamples · GitHub

is not allocating space for the host workspace (h_work). It is passed as nullptr to Xgetrf. Is that ok? In my sample program I allocated space for it.

I’ll double check and update the example.

I tried matrix size 50000 with Xgetrf. I see the following on A10 and A100.

# ./splineq2_legacy 50000
Matrix size = 50000
loop=1
cusolverDnXgetrf() finished but set error info to 1
# 
# ./splineq2_new 50000   
Matrix size = 50000
loop=1
cusolverDnXgetrf() finished but set error info to 1
# 

So, the program is able to allocate the required memory but sets info to 1.

With cuSolverDnSgetrf(), I had seen

# ./splineq1 50000
Matrix size = 50000
sp_lin_eq1.cu:45 cudaMallocManaged failed with error code 2; (cudaErrorMemoryAllocation) out of memory
# 

The memory allocation code is identical in both sample programs. So how come cudaMallocManaged() does not fail in the one using Xgetrf?

==============================================

Section 2.4.2.6. cusolverDngetrf() says

Remark: getrf uses fastest implementation with large workspace of size m*n. 
The user can choose the legacy implementation with minimal workspace by
Getrf and cusolverDnSetAdvOptions(params, CUSOLVERDN_GETRF, 
CUSOLVER_ALG_1). 

Section 2.4.4.3. cusolverDnXgetrf() says

Table of algorithms supported by cusolverDnXgetrf
CUSOLVER_ALG_0 or NULL 	Default algorithm. The fastest, requires a large workspace of m*n elements.
CUSOLVER_ALG_1 	        Legacy implementation

The default algorithm used in both APIs seems to be the same : a fast algo that uses a larger work space. Is that correct? If yes, is using cusolverDnSgetrf() the same as using cusolverDnXgetrf() with CUSOLVER_ALG_0 and CUDA_R_32F data type, in terms of performance and accuracy?

What is the legacy implementation mentioned in both sections?

Thanks

The error is telling you there’s an issue with the math. Such a large matrix may require double precision or maybe something is wrong with the input matrix???

The new algo manages memory differently.

I think it means that the diagonal element at (1,1) is 0. Why does this condition not arise in A10? The sample program uses a constant as the seed for the random numbers generated to be used as input. So the same random numbers would have been used on both A10 and A100. The sample program has been run on A10 for 3.5 hrs too. info was not set to a non-zero value in that time.

Thanks

The failure is seen on A40 too. Same sample program (sp_lin_eq1.cu). Failure symptoms are similar to A100.

The following is seen on A40 :

For matrix size 10000, the failure happens in the 75th iteration.
For matrix size 20000, the failure happens in the 51st iteration.
For matrix size 30000, the failure happens in the 43rd iteration.
For matrix size 40000, the failure happens in the 39th iteration.

The failing loops were the same on A100 too.

The failure is seen on A100 (Ampere 8.0) and A40 (Ampere 8.6) but not on A10 which is also Ampere 8.6.