Stdpar runtime crash related to stack memory

aklinvex · January 23, 2024, 10:58pm

I have a piece of code that theoretically performs an axpy using c++ stdpar. (I recognize that a sane person would not write it this way; I’m reproducing an issue I saw in an existing code I’m porting.)

#include<algorithm>
#include<cassert>
#include<execution>
#include<functional>
#include<ranges>
#include<vector>

struct data {
    int n;
    double *x, *y, *z;
};

void do_stuff(data* arrays) {
    auto groups = std::ranges::views::iota(0, arrays->n);
    std::for_each(std::execution::par_unseq, groups.begin(), groups.end(), [=](int g) {
        arrays->z[g] = arrays->x[g] + arrays->y[g];
    });
}

int main(int argc, char* argv[]) {
    using std::execution::par_unseq;
    using std::vector;

    data arrays;
    arrays.n = 100'000'000;
    arrays.x = new double[arrays.n];
    arrays.y = new double[arrays.n];
    arrays.z = new double[arrays.n];

    // Initialize x and y
    for(int i=0; i<arrays.n; i++) {
        arrays.x[i] = i;
        arrays.y[i] = arrays.n-i;
    }

    // Compute sum of x and y
    //std::transform(par_unseq, x.cbegin(), x.cend(), y.cbegin(),
    //               z.begin(), std::plus<>{});

    do_stuff(&arrays);

    // Assert that the sum is correct
    for(int i=0; i<arrays.n; i++) {
        assert(arrays.z[i] == arrays.n);
    }

    delete[] arrays.x;
    delete[] arrays.y;
    delete[] arrays.z;

    return 0;
}

I compile with nvc++ -stdpar -Minfo -std=c++20 main.cpp -o axpy, and this code crashes at runtime with a

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x155527836140

Thread 1 "axpy" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (128,0,0), device 0, sm 0, warp 4, lane 0]
0x0000155527836180 in void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::for_each_f<std::ranges::iota_view<int, int>::_Iterator, thrust::detail::wrapped_function<do_stuff(data*)::{lambda(int)#1}, void> >, decltype ((std::ranges::__detail::__to_signed_like<int>)((declval<int>)()))>, thrust::cuda_cub::for_each_f<std::ranges::iota_view<int, int>::_Iterator, thrust::detail::wrapped_function<do_stuff(data*)::{lambda(int)#1}, void> >, decltype ((std::ranges::__detail::__to_signed_like<int>)((declval<int>)()))>(thrust::cuda_cub::for_each_f<std::ranges::iota_view<int, int>::_Iterator, thrust::detail::wrapped_function<do_stuff(data*)::{lambda(int)#1}, void> >, decltype ((std::ranges::__detail::__to_signed_like<int>)((declval<int>)())))<<<(195313,1,1),(256,1,1)>>> () at /home/klinalic/gpu-programming-models/axpy/main.cpp:16 in do_stuff(data*)::{lambda(int)#1}::operator() inlined from function.h:125
16	        arrays->z[g] = arrays->x[g] + arrays->y[g];

Interestingly, if I change data arrays; to data* arrays = new data; (and propagate that change through the rest of the code), the code works fine. Is this because the second is allocating memory on the heap? Is this intended behavior?

MatColgrove · January 24, 2024, 4:29pm

We use CUDA Managed (-gpu=managed) memory in order to create a unified address space which can be accessed on either the host or device. However managed is only available with allocated (heap) data. Given “arrays” is static, you’re accessing the host address on the device and hence the runtime error. Allocating “arrays” will use a unified address and why it works.

We do have a new feature which supports full unified memory (-gpu=unified) where both static and dynamic data can be accessed on both. However due to additional required hardware and OS support, this feature is currently only available on Grace-Hopper systems.

aklinvex · January 24, 2024, 10:23pm

Thanks for the explanation!

Topic		Replies	Views
Nvc++ PSTL: scan failed to synchronize: cudaErrorIllegalAddress nvc, nvc++ and nvfortran	4	800	March 22, 2021
Illegal memory access with unified memory CUDA Programming and Performance cuda	4	618	June 13, 2023
Can't run simple std::par program nvc, nvc++ and nvfortran	3	196	May 27, 2024
Std::inclusive_scan crashes with transform view with stdpar nvc, nvc++ and nvfortran	8	325	July 29, 2024
Nvc++ & external CUDA-thrust conflicts for -stdpar offload nvc, nvc++ and nvfortran	5	442	December 12, 2022
nvcc -arch sm_20 causes access violations in shared memory CUDA Programming and Performance	8	2550	March 30, 2013
Call to cuStreamSynchronize returned error 700: Illegal address during kernel execution nvc, nvc++ and nvfortran	14	2796	July 17, 2020
Cuda application crashes works fine for small data and crashes for big data CUDA Developer Tools	0	360	December 8, 2020
Cuda application crashes works fine for small data and crashes for big data CUDA Programming and Performance	3	408	October 12, 2021
LLVM Error when compiling C++ STD parallel execution policies to GPU nvc, nvc++ and nvfortran	9	412	May 2, 2024

Stdpar runtime crash related to stack memory

Related topics