cusolverDnSgetrf() fails on A100 (but not on A10) when called in a tight loop

karthikeyan_d · January 10, 2022, 4:06am

I am new to CUDA programming. I am doing some experiments with the cuSOLVER Lapack APIs. One sample program calls cusolverDnSgetrf() and cusolverDnSgetrs() in an infinite loop. This code runs fine on A10. However, on A100 (A100-SXM4-40GB) it runs fine for a few iterations and then the cudaDeviceSynchronize() called after cusolverDnSgetrf() fails with error code 700 (cudaErrorIllegalAddress).

For matrix size 10000, the failure happens in the 72nd iteration.
For matrix size 20000, the failure happens in the 49th iteration.
For matrix size 30000, the failure happens in the 42nd iteration.
For matrix size 40000, the failure happens in the 39th iteration.
Ran matrix size 100 for ~50000 iterations. No failures.
Ran matrix size 1000 for ~5000 iterations. No failures.
Ran matrix size 5000 for ~1000 iterations. No failures.

Tried cuda-memcheck on the A100. There were 112 errors, all saying “Invalid global read of size 4” in getrf3_swap_alg1(). Some of the error messages are given below.

Since I use an infinite loop, nothing was reported in A10 where the program has no failures. I changed the code to execute for only 100 iterations. cuda-memcheck did not report any errors on A10.

The code looks something like this

    cuda_stat = cudaMallocManaged(&a, (n * n * sizeof(*a)));
    CUDA_CHECK(cuda_stat, "cudaMallocManaged");

    cuda_stat = cudaMallocManaged(&b, (n * sizeof (*b)));
    CUDA_CHECK(cuda_stat, "cudaMallocManaged");

    cuda_stat = cudaMallocManaged(&x, (n * sizeof (*x)));
    CUDA_CHECK(cuda_stat, "cudaMallocManaged");

    cuda_stat = cudaMallocManaged(&pivot, (n * sizeof (*pivot)));
    CUDA_CHECK(cuda_stat, "cudaMallocManaged");

    cuda_stat = cudaMallocManaged(&info, sizeof (*info));
    CUDA_CHECK(cuda_stat, "cudaMallocManaged");

    cusolver_stat = cusolverDnCreate(&handle);

    if (cusolver_stat != CUSOLVER_STATUS_SUCCESS) {
        printf ("cuSOLVER initialization failed with error code %d\n", cusolver_stat);
        exit(-1);
    }

.  .  .
.  .  .
    cusolver_stat = cusolverDnSgetrf_bufferSize(handle, n, n, a, lda, &wspace_size);

    cuda_stat = cudaDeviceSynchronize();
    CUDA_CHECK(cuda_stat, "cudaDeviceSynchronize");

    if (cusolver_stat != CUSOLVER_STATUS_SUCCESS) {
        printf ("cuSOLVER dense getrf_bufferSize failed with error code %d\n", cusolver_stat);
        exit(-1);
    }

    cuda_stat = cudaMallocManaged(&wspace, (wspace_size * sizeof (*wspace)));
    CUDA_CHECK(cuda_stat, "cudaMallocManaged");

i = 1;

while (1) {
printf("loop=%d\n", i); fflush(NULL);i++;

    cusolver_stat = cusolverDnSgetrf(handle, n, n, a, lda, wspace, pivot, info);

    cuda_stat = cudaDeviceSynchronize();
    CUDA_CHECK(cuda_stat, "cudaDeviceSynchronize");

    if (cusolver_stat != CUSOLVER_STATUS_SUCCESS) {
        printf ("cuSOLVER dense getrf failed with error code %d\n", cusolver_stat);
        exit(-1);
    }

    if (*info != 0) {
        printf ("cuSOLVER dense getrf finished but set error info to %d\n", *info);
        exit(-1);
    }   

    cublasOperation_t trans = CUBLAS_OP_N;
    int nrhs = 1;
    int ldb = n;

    cusolver_stat = cusolverDnSgetrs(handle, trans, n, nrhs, a, lda, pivot, b, ldb, info);

    cuda_stat = cudaDeviceSynchronize();
    CUDA_CHECK(cuda_stat, "cudaDeviceSynchronize");

    if (cusolver_stat != CUSOLVER_STATUS_SUCCESS) {
        printf ("cuSOLVER dense getrs failed with error code %d\n", cusolver_stat);
        exit(-1);
    }

    if (*info != 0) {
        printf ("cuSOLVER dense getrs finished but set error info to %d\n", *info);
        exit(-1);
    }

} /* while (1) */

The input matrices are initialized only once, before the infinite loop. So after the first iteration, whatever is inside the matrices are passed as an input in the subsequent iterations. This is not a problem for my program as it does not do any data validations. I use this program to see its effect on the GPU power.

I thought that perhaps I needed cudaDeviceSynchronize() after the calls to cudaMallocManaged(). So I added it in two places : once after the first set of cudaMallocManaged() calls and again after the cudaMallocManaged() for wspace. No change in behaviour.

All 112 cuda-memcheck errors on A100 seems to happen in the same code location. Given below are 3 errors :

========= Invalid __global__ read of size 4
=========     at 0x000007f0 in void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int) 
=========     by thread (15,0,0) in block (39,0,0)
=========     Address 0x7f9089d8203c is out of bounds
=========     Device Frame:void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int) (void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int) : 0x7f0)
=========     Saved host backtrace up to driver entry point at kernel launch time 
=========     Host Frame:/lib64/libcuda.so [0x20d6ea]
=========     Host Frame:./nvgpu_lapack [0x20af0b]
=========     Host Frame:./nvgpu_lapack [0x261548]
=========     Host Frame:./nvgpu_lapack [0x5f524]
=========     Host Frame:./nvgpu_lapack [0x611d1]
=========     Host Frame:./nvgpu_lapack [0x52fb3]
=========     Host Frame:./nvgpu_lapack [0x4bc30]
=========     Host Frame:./nvgpu_lapack [0x4d141]
=========     Host Frame:./nvgpu_lapack [0xd64f]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22555]
=========     Host Frame:./nvgpu_lapack [0x4b2b9]
=========
========= Invalid __global__ read of size 4
=========     at 0x000007f0 in void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int) 
=========     by thread (14,0,0) in block (39,0,0)
=========     Address 0x7f9089d82038 is out of bounds
=========     Device Frame:void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int) (void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int) : 0x7f0)
=========     Saved host backtrace up to driver entry point at kernel launch time 
=========     Host Frame:/lib64/libcuda.so [0x20d6ea]
=========     Host Frame:./nvgpu_lapack [0x20af0b]
=========     Host Frame:./nvgpu_lapack [0x261548]
=========     Host Frame:./nvgpu_lapack [0x5f524]
=========     Host Frame:./nvgpu_lapack [0x611d1]
=========     Host Frame:./nvgpu_lapack [0x52fb3]
=========     Host Frame:./nvgpu_lapack [0x4bc30]
=========     Host Frame:./nvgpu_lapack [0x4d141]
=========     Host Frame:./nvgpu_lapack [0xd64f]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22555]
=========     Host Frame:./nvgpu_lapack [0x4b2b9]
=========
========= Invalid __global__ read of size 4
=========     at 0x000007f0 in void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int)
=========     by thread (13,0,0) in block (39,0,0)
=========     Address 0x7f9089d82034 is out of bounds
=========     Device Frame:void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int) (void getrf3_swap_alg1<float, int=256>(int, int, int, float*, int, int const *, int, int) : 0x7f0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib64/libcuda.so [0x20d6ea]
=========     Host Frame:./nvgpu_lapack [0x20af0b]
=========     Host Frame:./nvgpu_lapack [0x261548]
=========     Host Frame:./nvgpu_lapack [0x5f524]
=========     Host Frame:./nvgpu_lapack [0x611d1]
=========     Host Frame:./nvgpu_lapack [0x52fb3]
=========     Host Frame:./nvgpu_lapack [0x4bc30]
=========     Host Frame:./nvgpu_lapack [0x4d141]
=========     Host Frame:./nvgpu_lapack [0xd64f]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22555]
=========     Host Frame:./nvgpu_lapack [0x4b2b9]
=========

I would like to understand

what is causing the failure on A100 (A100-SXM4-40GB). The addresses
passed to the APIs do not change between iterations
why this program runs fine on A10 but fails on A100
why it is able to run fine for some time on A100 before the failure
why it does not (seem to) fail on A100 for matrix sizes smaller
than 10000.

Thanks

karthikeyan_d · January 10, 2022, 4:11am

Attaching the full output of the cuda-memcheck tool on the A100.
cudamemchk_errs_1.txt (127.3 KB)

I am using CUDA 11.4 with gcc9.

mnicely · January 10, 2022, 12:20pm

Have you checked result in info after each step?

github.com

NVIDIA/CUDALibrarySamples/blob/master/cuSOLVER/getrf/cusolver_getrf_example.cu

/*
 * Copyright 2020 NVIDIA Corporation.  All rights reserved.
 *
 * NOTICE TO LICENSEE:
 *
 * This source code and/or documentation ("Licensed Deliverables") are
 * subject to NVIDIA intellectual property rights under U.S. and
 * international Copyright laws.
 *
 * These Licensed Deliverables contained herein is PROPRIETARY and
 * CONFIDENTIAL to NVIDIA and is being provided under the terms and
 * conditions of a form of NVIDIA software license agreement by and
 * between NVIDIA and Licensee ("License Agreement") or electronically
 * accepted by Licensee.  Notwithstanding any terms or conditions to
 * the contrary in the License Agreement, reproduction or disclosure
 * of the Licensed Deliverables to any third party without the express
 * written consent of NVIDIA is prohibited.
 *
 * NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE
 * LICENSE AGREEMENT, NVIDIA MAKES NO REPRESENTATION ABOUT THE

This file has been truncated. show original

karthikeyan_d · January 10, 2022, 12:49pm

I had that check for the getrs call but had not made it a part of the infinite loop. I had forgotten this check for the getrf call. I have added it now and included both checks in the infinite loop. I have edited the code in the first post.

No change in behaviour : program runs fine on A10 but fails on A100.

mnicely · January 10, 2022, 3:04pm

Okay thanks for checking.
Can you file a bug and provide a complete reproducer?

karthikeyan_d · January 11, 2022, 7:16am

Will file a bug. However, there is no file attach feature in the bug-filing page. So I am attaching the sample program here. Will provide this link in the bug description.

sp_lin_eq1.cu (3.8 KB)

The program takes one argument, the matrix size. If not provided, 10000 will be assumed. I have tested with 10000, 20000, 30000 and 40000.

The following is seen on A100 :

For matrix size 10000, the failure happens in the 75th iteration.
For matrix size 20000, the failure happens in the 51st iteration.
For matrix size 30000, the failure happens in the 43rd iteration.
For matrix size 40000, the failure happens in the 39th iteration.

No failures seen on A10. Have ran with matrix size 40000 on A10. It ran for more than 3.5 hours with no failures and completed over 3700 iterations.

The compilation command is

$ nvcc -m64 -ccbin <gcc_path> -I <cuda_include_path> -I <math_libs_include_path> sp_lin_eq1.cu -o splineq1 -L <math_libs_path> -L <cuda_libs_path> -Xlinker -Bstatic -lcusolver_static -lcublas_static -lculibos -lcudart_static -lcublasLt_static -Xlinker -Bdynamic -ldl -lpthread -lrt

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:19_PDT_2021
Cuda compilation tools, release 11.4, V11.4.100
Build cuda_11.4.r11.4/compiler.30188945_0
$

karthikeyan_d · January 11, 2022, 7:31am

Have filed the following bug

cudaErrorIllegalAddress reported after a call to cusolverDnSgetrf() on A100 (but not on A10) when executed in a tight loop
https://developer.nvidia.com/nvidia_bug/3496716

mnicely · January 11, 2022, 9:54pm

Thanks, in the meantime, do you mind trying to implement with our 64-bit API?

karthikeyan_d · January 12, 2022, 1:19am

Does this mean that cusolverDnSgetrf() and cusolverDnDgetrf() contain 32-bit code?

What are the other differences between the legacy and the 64-bit APIs?

Thanks

mnicely · January 12, 2022, 1:52am

The allow for much larger matrix sizes and are generic typed.

karthikeyan_d · January 12, 2022, 3:25am

Similar behaviour with the 64-bit APIs too i.e failures on A100 only. However, there are two types of failures. One is the cudaErrorIllegalAddress. In the other case cusolverDnXgetrf() sets info to a non-zero value. The later happens more.

No failures seen so far on A10.

Given below are the error messages and the loop iteration in which the failure happens. The messages are from the program.

Note that the loop numbers are the same as the previous sample program.

CUSOLVER_ALG_1 (Legacy algo)
10000 : loop 75 : cusolverDnXgetrf() finished but set error info to 5884
20000 : loop 51 : sp_lin_eq2.cu:133 cudaDeviceSynchronize failed with error code 700; (cudaErrorIllegalAddress) an illegal memory access was encountered
30000 : loop 43 : cusolverDnXgetrf() finished but set error info to 25178
40000 : loop 39 : sp_lin_eq2.cu:133 cudaDeviceSynchronize failed with error code 700; (cudaErrorIllegalAddress) an illegal memory access was encountered

CUSOLVER_ALG_0 (New algo)
10000 : loop 75 : cusolverDnXgetrf() finished but set error info to 5887
20000 : loop 51 : cusolverDnXgetrf() finished but set error info to 15532
30000 : loop 43 : cusolverDnXgetrf() finished but set error info to 25010
40000 : loop 39 : cusolverDnXgetrf() finished but set error info to 35067

The new sample program is attached.
sp_lin_eq2.cu (4.9 KB)

The documentation says the following

If LU factorization failed, i.e. matrix A (U) is singular, The output parameter info=i indicates U(i,i) = 0.

====================================================

The Xgetrf example in

CUDALibrarySamples/cusolver_Xgetrf_example.cu at master · NVIDIA/CUDALibrarySamples · GitHub

is not allocating space for the host workspace (h_work). It is passed as nullptr to Xgetrf. Is that ok? In my sample program I allocated space for it.

====================================================

You might want to ask your documentation team to take a closer look at the cuSOLVER doc.

In many places the word “below” is spelt “bellow”.
In section 2.4.4.3. cusolverDnXgetrf(), it says that workspaceInBytesOnDevice and workspaceInBytesOnHost are values returned by cusolverDnXpotrf_bufferSize instead of cusolverDnXgetrf_bufferSize.

Thanks

mnicely · January 12, 2022, 3:45am

Thanks for checking. I’ll add this to bug report.

The Xgetrf example in

CUDALibrarySamples/cusolver_Xgetrf_example.cu at master · NVIDIA/CUDALibrarySamples · GitHub

is not allocating space for the host workspace (h_work). It is passed as nullptr to Xgetrf. Is that ok? In my sample program I allocated space for it.

I’ll double check and update the example.

karthikeyan_d · January 12, 2022, 5:22am

I tried matrix size 50000 with Xgetrf. I see the following on A10 and A100.

# ./splineq2_legacy 50000
Matrix size = 50000
loop=1
cusolverDnXgetrf() finished but set error info to 1
# 
# ./splineq2_new 50000   
Matrix size = 50000
loop=1
cusolverDnXgetrf() finished but set error info to 1
#

So, the program is able to allocate the required memory but sets info to 1.

With cuSolverDnSgetrf(), I had seen

# ./splineq1 50000
Matrix size = 50000
sp_lin_eq1.cu:45 cudaMallocManaged failed with error code 2; (cudaErrorMemoryAllocation) out of memory
#

The memory allocation code is identical in both sample programs. So how come cudaMallocManaged() does not fail in the one using Xgetrf?

==============================================

Section 2.4.2.6. cusolverDngetrf() says

Remark: getrf uses fastest implementation with large workspace of size m*n. 
The user can choose the legacy implementation with minimal workspace by
Getrf and cusolverDnSetAdvOptions(params, CUSOLVERDN_GETRF, 
CUSOLVER_ALG_1).

Section 2.4.4.3. cusolverDnXgetrf() says

Table of algorithms supported by cusolverDnXgetrf
CUSOLVER_ALG_0 or NULL 	Default algorithm. The fastest, requires a large workspace of m*n elements.
CUSOLVER_ALG_1 	        Legacy implementation

The default algorithm used in both APIs seems to be the same : a fast algo that uses a larger work space. Is that correct? If yes, is using cusolverDnSgetrf() the same as using cusolverDnXgetrf() with CUSOLVER_ALG_0 and CUDA_R_32F data type, in terms of performance and accuracy?

What is the legacy implementation mentioned in both sections?

Thanks

mnicely · January 12, 2022, 3:53pm

The error is telling you there’s an issue with the math. Such a large matrix may require double precision or maybe something is wrong with the input matrix???

The new algo manages memory differently.

karthikeyan_d · January 28, 2022, 1:04am

I think it means that the diagonal element at (1,1) is 0. Why does this condition not arise in A10? The sample program uses a constant as the seed for the random numbers generated to be used as input. So the same random numbers would have been used on both A10 and A100. The sample program has been run on A10 for 3.5 hrs too. info was not set to a non-zero value in that time.

Thanks

karthikeyan_d · February 23, 2022, 5:49am

The failure is seen on A40 too. Same sample program (sp_lin_eq1.cu). Failure symptoms are similar to A100.

The following is seen on A40 :

For matrix size 10000, the failure happens in the 75th iteration.
For matrix size 20000, the failure happens in the 51st iteration.
For matrix size 30000, the failure happens in the 43rd iteration.
For matrix size 40000, the failure happens in the 39th iteration.

The failing loops were the same on A100 too.

The failure is seen on A100 (Ampere 8.0) and A40 (Ampere 8.6) but not on A10 which is also Ampere 8.6.

Topic		Replies	Views
Warning: Cuda API error detected: cuModuleLoadFatBinary returned (0xd1) CUDA-GDB	25	1100	July 30, 2024
Reading R8G8B8A8 texture using tex2D() causes strange result. CUDA Programming and Performance	27	2880	April 28, 2018
Questions about cuFFT for 3D matrix, arrayFire GPU-Accelerated Libraries	5	1664	October 12, 2021
Wrong output of Matrix Inversion using cuSOLVER GPU-Accelerated Libraries cuda	7	98	July 22, 2024
Program hit cudaErrorInvalidValue (error 1) due to "invalid argument" on CUDA API call to cudaMemsetAsync CUDA Programming and Performance	7	7598	January 11, 2020
NVProf error on samples CUDA Programming and Performance	28	20459	December 29, 2020
New Features in CUDA 7.5 Technical Blog	66	1089	August 10, 2016
Memory space used by cusolverDN_getrf GPU-Accelerated Libraries cusolver	9	1751	July 9, 2021
"A Unwinder should return gdb.UnwindInfo instance." error with NVIDIA Nsight Visual Studio Code Edition Nsight Visual Studio Code Edition nsight , vscode , cuda-gdb	11	998	July 5, 2023
CUDA very slow performance CUDA Programming and Performance	21	16743	March 6, 2020

cusolverDnSgetrf() fails on A100 (but not on A10) when called in a tight loop

Related topics