Nvfortran with -C or -Mbounds flag does not check out of bounds memory access

Short description

When compiling a program using nvfortran with -C -g -traceback or -Mbounds -g -traceback flags, the executable does not stop at out of bounds memory accesses, leading to (expected) weird memory bugs that are hard to track. This occurs on both cpu and gpu kernels, no warning is printed.

When using ifort for the same thing with -C -g -traceback flag, the compiler stops and prints the exact line that the out of bounds access was encountered.

How can I accomplish this using nvfortran? I though it was a CUDA specific problem, however as I was creating the examples to create this question here, I found out that nvfortran does the same thing on CPU code. Thus, it is not CUDA specific.

Simple code examples

Below are two simple, crude test programs that try to access out of bounds memory in cpu and gpu code. I put the gpu example separately, so one can test the cpu example with different compilers and examine their behavior.

testOutOfBounds.f90

module sizes

    integer, save :: size1
    integer, save :: size2

end module sizes

module arrays

    real, allocatable, save :: testArray1(:, :)
    real, allocatable, save :: testArray2(:, :)

end module arrays

subroutine testMemoryAccess
    use sizes
    use arrays

    implicit none

    real :: value

    value = testArray1(size1+1, size2+1)
    print *, 'value', value

end subroutine testMemoryAccess

Program testMemoryAccessOutOfBounds
    use sizes
    use arrays

    implicit none

    ! set sizes for the example
    size1 = 5000
    size2 = 2500

    allocate (testArray1(size1, size2))
    allocate (testArray2(size2, size1))
    testArray1 = 1.d0
    testArray2 = 2.d0

    call testMemoryAccess

end program testMemoryAccessOutOfBounds

testOutOfBoundsCuda.f90

module sizes

    integer, save :: size1
    integer, save :: size2

end module sizes

module sizesCuda

    integer, device, save :: size1
    integer, device, save :: size2

end module sizesCuda

module arrays

    real, allocatable, save :: testArray1(:, :)
    real, allocatable, save :: testArray2(:, :)

end module arrays

module arraysCuda

    real, allocatable, device, save :: testArray1(:, :)
    real, allocatable, device, save :: testArray2(:, :)

end module arraysCuda

module cudaKernels
    use cudafor
    use sizesCuda
    use arraysCuda

contains

    attributes(global) Subroutine testMemoryAccessCuda

        implicit none

        integer :: element

        real :: value

        element = (blockIdx%x - 1)*blockDim%x + threadIdx%x

        if (element.eq.1) then

            value = testArray1(size1+1, size2+1)
            print *, 'value', value

        end if

    end Subroutine testMemoryAccessCuda

end module cudaKernels

Program testMemoryAccessOutOfBounds
    use cudafor
    use cudaKernels
    use sizes
    use sizesCuda, size1_d => size1, size2_d => size2
    use arrays
    use arraysCuda, testArray1_d => testArray1, testArray2_d => testArray2

    implicit none

    integer :: istat

    ! set sizes for the example
    size1 = 5000
    size2 = 2500

    size1_d = size1
    size2_d = size2

    allocate (testArray1_d(size1, size2))
    allocate (testArray2_d(size2, size1))
    testArray1_d = 1.d0
    testArray2_d = 2.d0

    call testMemoryAccessCuda<<<64, 64>>>
    istat = cudadevicesynchronize()

end program testMemoryAccessOutOfBounds

Compilers used:

nvfortran

nvfortran 23.5-0 64-bit target on x86-64 Linux -tp zen2
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

ifort

ifort (IFORT) 2021.10.0 20230609
Copyright (C) 1985-2023 Intel Corporation.  All rights reserved.

Steps

nvfortran

I compile the examples as follows:

nvfortran -C -g -traceback -Mlarge_arrays -Mdclchk -cuda -gpu=cc86 testOutOfBounds.f90
nvfortran -C -g -traceback -Mlarge_arrays -Mdclchk -cuda -gpu=cc86 testOutOfBoundsCuda.f90

When running the cpu code, I get a non-initialized array value:

value   1.5242136E-27

When running the gpu code, I get a zero value:

value    0.000000

ifort

I compile the cpu example as follows:

ifort -C -g -traceback testOutOfBounds.f90

and I get:

forrtl: severe (408): fort: (2): Subscript #2 of the array TESTARRAY1 has value 2501 which is greater than the upper bound of 2500

Image              PC                Routine            Line        Source
a.out              00000000004043D4  testmemoryaccess_          23  testOutOfBounds.f90
a.out              0000000000404FD6  MAIN__                     43  testOutOfBounds.f90
a.out              000000000040418D  Unknown               Unknown  Unknown
libc.so.6          00007F65A9229D90  Unknown               Unknown  Unknown
libc.so.6          00007F65A9229E40  __libc_start_main     Unknown  Unknown
a.out              00000000004040A5  Unknown               Unknown  Unknown

which is actually what I expect the compiler to print.

Hi before_may,

Your first example works as expected when targeting the host. However, bounds checking is not supported in device code and the -Mbounds/-C flags are disabled when including GPU compiler flags as indicated by the generated warning:

% nvfortran -C -g -traceback -Mlarge_arrays -Mdclchk -cuda -gpu=cc86 test_bounds1.f90
nvfortran-Warning-CUDA Fortran or OpenACC GPU targets disables -Mbounds

Here’s the output from a host compilation:

% nvfortran -C -g -traceback -Mlarge_arrays -Mdclchk  test_bounds1.f90; a.out
0: Subscript out of range for array testarray1 (test_bounds1.f90: 23)
    subscript=5001, lower bound=1, upper bound=5000, dimension=1

Hope this clarifies things.

-Mat

Hi Mat,

I didn’t know that cuda flags also affect pure cpu code, apologies.
I am working on a big fluid dynamics solver that has both cpu and gpu kernels, thus I use the same flags for all source files.
I guess I’ll have to separate flags depending on different files (cpu target and gpu target) to at least leverage array bounds checking on cpu kernels.

However, my initial problem involves weird memory behavior on the gpu that leads to NaN errors, most probably due to out of bounds memory access on device code.
I have faced this problem in the past and was able to track the bugs down the old way, i.e. commenting out code and try to see where the problem appears.
This debugging process is, unfortunately, very time consuming. In addition, this time I cannot seem to find the out-of-bounds access that leads to NaN errors.

Can I locate out-of-bounds accesses in a more efficient way?

My application utilizes both MPI and CUDA, thus using Nsight tools is not that straightforward.
(I use mpif90 and not nvfortran to compile, Nsight tools almost always use non-mpi compilers and I cannot seem to find a way to set them up with mpi)

Try using the compute-sanitizer utility. It has the ability to check for memory issues.

My application utilizes both MPI and CUDA, thus using Nsight tools is not that straightforward.

Which Nsight Tool? Nsight-Systems can profile multi-ranks and can even trace MPI communication (via the “-t mpi” flag). It’s limited to a single node, but hopefully that’s still useful.

For Nsight-Compute, it’s better to add a shell wrapper script around the mpirun launch and either have only one rank get profiled, or use separate file names for each rank.

The problem is that in order to run my executable, I need to invoke mpirun or mpiexec as mpirun -np 1 app_name for a single thread/process run, since the main program starts with MPI_init.

I should clarify that I am not looking to debug in parallel mode, running with one process for debugging purposes is more than enough. I am not trying to trace MPI bugs; whatever is wrong with my code will show up even in single process runs.

However, most debugging tools require the following syntax (take compute-sanitizer as an example):
compute-sanitizer [options] app_name [app_options]
app_name requires a single executable name, however I need to invoke mpirun -np 1 app_name.

Perhaps my question is silly, however I cannot find a way to use these tools with executables that need to be called with mpirun or mpiexec.

Sorry, I’m, not clear what the issue is. I often run my applications use “mpirun -np 1 utility_name app_name” without issue.

Have you tried running “mpirun -np 1 compute-santizer app_name”?
Is there a particular error you’re getting?

The application will still get run as normal, including the MPI initialization.

I do this with cuda-gdb and Nsight-Compute (ncu) as well. For Nsight-Systems (nsys), it can be put before mpirun for multi-rank profiling, or just before the application for a profile per rank.

Also, I’ll often use wrapper script to set environment variables on a per rank basis. For example this one to set CUDA_VISIBLE_DEVICES to the OpenMPI local rank:

% cat wrapper.sh
#!/bin/bash
export LOCAL_RANK=$OMPI_COMM_WORLD_LOCAL_RANK
export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK
exec $*

This would then be invoked as:

mpirun -np $numranks sh wrapper.sh app_name

Well, you just solved a very old question of mine!

I always though that the proper syntax to use with external tools with mpi was:

compute-sanitizer mpirun -np 1 app_name

As expected, this does not work because it assumes that mpirun is the name of the executable.

I never though of trying

mpirun -np 1 compute-sanitizer app_name

as you suggested.

Thank you!

p.s.
I used compute-sanitizer with default options and I was able to locate the bug very easily!!
Thanks a bunch!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.