Stack overflow error with compute-sanitizer

m96 · September 20, 2023, 8:12pm

Hi everyone, I’m modifying a library to make it possible to use the GPU. I modified the code to ensure that various functions can be executed directly within the GPU when called from my kernel.

the kernel :

__global__ void SetTDState_rhoe_Kernel(CCoolProp * d_CoolProp, Helmholtz *d_HELMHOLTZ, su2double *d_DENSITY,  su2double *d_ENERGY,su2double *d_PRESSURE, su2double *d_TEMPERATURE,
						su2double *d_SOUNDSPEED2, su2double *d_DPDRHOE, su2double *d_DPDERHO, su2double *d_DTDERHO, su2double *d_DTDRHOE, su2double *d_ENTROPY,
						unsigned long nPoint){

	int id= blockIdx.x * blockDim.x  + threadIdx.x;


// VERSIONE BASE

	if(id<nPoint){

		su2double Density = d_DENSITY[id];
		su2double StaticEnergy = d_ENERGY[id];
		d_CoolProp->update(d_HELMHOLTZ[id], DmassUmass_INPUTS, Density,
				StaticEnergy);

		__syncthreads();

		d_PRESSURE[id] = d_HELMHOLTZ[id].p();
		d_TEMPERATURE[id] = d_HELMHOLTZ[id].T();
		d_ENTROPY[id] = d_HELMHOLTZ[id].smass();
		d_DPDRHOE[id] = d_CoolProp[id].calc_first_partial_deriv(d_HELMHOLTZ[id],
				iP, iDmass, iUmass);
		d_DPDERHO[id] = d_CoolProp[id].calc_first_partial_deriv(d_HELMHOLTZ[id],
				iP, iUmass, iDmass);
		d_DTDRHOE[id] = d_CoolProp[id].calc_first_partial_deriv(d_HELMHOLTZ[id],
				iT, iDmass, iUmass);
		d_DTDERHO[id] = d_CoolProp[id].calc_first_partial_deriv(d_HELMHOLTZ[id],
				iT, iUmass, iDmass);
		if (d_HELMHOLTZ[id]._phase == iphase_twophase) {
			// impose gas phase
			d_TEMPERATURE[id] = d_TEMPERATURE[id] + 0.1;
			d_CoolProp[id].CheckPressure1(d_PRESSURE[id]);
			d_CoolProp[id].CheckTemperature1(d_TEMPERATURE[id]);
			d_CoolProp[id].update(d_HELMHOLTZ[id], PT_INPUTS, d_PRESSURE[id],
					d_TEMPERATURE[id]);
			d_SOUNDSPEED2[id] = pow(
					d_CoolProp[id].calc_speed_sound(d_HELMHOLTZ[id]), 2);
		} else {
			d_SOUNDSPEED2[id] = pow(
					d_CoolProp[id].calc_speed_sound(d_HELMHOLTZ[id]), 2);
		}

	}

}

The update function:

__device__ void CCoolProp::update(Helmholtz& HEOS, input_pairs pairs, su2double value1, su2double value2){

    su2double ld_value1 = value1, ld_value2 = value2;
    pre_update(HEOS, pairs, ld_value1, ld_value2);
    value1 = ld_value1;
    value2 = ld_value2;


    switch (pairs) {
    case PT_INPUTS:
        HEOS._p = value1;
        HEOS._T = value2;
        PT_flash(HEOS);
        break;

    case DmolarT_INPUTS:

        HEOS._rhomolar = value1;
        HEOS._T = value2;
        DHSU_T_flash(HEOS, iDmolar);
        break;
    case SmolarT_INPUTS:
        HEOS._smolar = value1;
        HEOS._T = value2;
        DHSU_T_flash(HEOS, iSmolar);
        break;
    case DmolarUmolar_INPUTS:

        HEOS._rhomolar = value1;
        HEOS._umolar = value2;
        HSU_D_flash(HEOS, iUmolar);
        break;
    case QT_INPUTS:
        HEOS._Q = value1;
        HEOS._T = value2;
        if ((HEOS._Q < 0) || (HEOS._Q > 1)) printf("Input vapor quality [Q] must be between 0 and 1 \n");
        QT_flash(HEOS);
        break;
    }
}

This is only a small part of the code, but if I run the program with the compute-sanitizer debugger I get:

Exit_CompSan.txt (106.2 KB)

For example if you catch the first mistake reported in the log file:

========= COMPUTE-SANITIZER
========= Stack overflow
========= at 0x4a9580 in /home/marco/CFD/SU2_CUDA_HELMHOLTZ/SU2_CFD/src/fluid/CCoolProp.cu:1384:SetTDState_rhoe_Kernel(CCoolProp *, Helmholtz *, double *, double *, double *, double *, double *, double *, double *, double *, double *, double *, unsigned long)
========= by thread (0,0,0) in block (12,0,0)

Where the function at line 1384 of CCoolProp.cu is:

__device__ void CCoolProp::QT_flash(Helmholtz& HEOS) {

       // The maximum possible saturation temperature
       // Critical point for pure fluids, slightly different for pseudo-pure, very different for mixtures
       CoolPropDbl Tmax_sat = HEOS.calc_Tmax_sat() + 1e-13;

       // Check what the minimum limits for the equation of state are
       CoolPropDbl Tmin_sat;
//       calc_Tmin_sat(HEOS, Tmin_satL, Tmin_satV);
       Tmin_sat = max(HEOS.sat_min_liquid.T, HEOS.sat_min_vapor.T) - 1e-13;

//        // Get a reference to keep the code a bit cleaner
//        const CriticalRegionSplines& splines = HEOS.components[0].EOS().critical_region_splines;

           // Set some input options
           saturation_T_pure_Akasaka_options options(false);

           // Actually call the solver
           saturation_T_pure_Maxwell(HEOS, HEOS._T, options);

           HEOS._p = 0.5 * HEOS.SatV->p() + 0.5 * HEOS.SatL->p();
           HEOS._rhomolar = 1 / (HEOS._Q / HEOS.SatV->rhomolar() + (1 - HEOS._Q) / HEOS.SatL->rhomolar());
       // Load the outputs
       HEOS._phase = iphase_twophase;

}

My question is : what kind of error is this :

========= COMPUTE-SANITIZER
========= Stack overflow
========= at 0x4a9580 in /home/marco/CFD/SU2_CUDA_HELMHOLTZ/SU2_CFD/src/fluid/CCoolProp.cu:1384:SetTDState_rhoe_Kernel(CCoolProp *, Helmholtz *, double *, double *, double *, double *, double *, double *, double *, double *, double *, double *, unsigned long)
========= by thread (0,0,0) in block (12,0,0)

Robert_Crovella · September 20, 2023, 8:30pm

Just like a CPU, each CUDA thread maintains a stack. The stack is of fixed size. If your code behaves in such a way that it uses more stack space than that fixed size, it will result in a stack overflow. The machine is not ordinarily guaranteed to automatically catch such an error the moment it happens (although the resultant corrupted program behavior may eventually trigger a machine fault of some kind), but the additional instrumentation in compute-sanitizer can detect it. here is a general description, not GPU-specific.

Along with the above general description, I would pay close attention to any compiler warning messages that are emitted when compiling your code. Sometimes compiler warning messages take the form of “… stack size … cannot be statically determined …” Any such messages may provide additional clues. I’m not suggesting that the compiler would always emit such a warning message, or that it is doing so in your case. However if there are any such warning messages, they may provide additional clues.

Here is an example program that triggers that stack overflow error in compute-sanitizer:

# cat t48a.cu
#include <cstdio>
#include <iostream>
#include <cstdlib>

__device__ int a(int m, int n);

__global__ void k(int x, int y){
  printf("%d\n", a(x,y));
}

int main(int argc, char *argv[]){

  int val = 9;  // 3 or higher seems to be enough to trigger the fault
  if (argc > 1) val = atoi(argv[1]);
  k<<<1,1>>>(val,val);
  cudaError_t err = cudaDeviceSynchronize();
  std::cout << cudaGetErrorString(err) << std::endl;
}
# cat t48b.cu
__device__ int a(int m, int n){
    if (m == 0)
        return n + 1;
    if ((m > 0) && (n == 0))
        return a(m-1,1);
    else
        return a(m-1,a(m,n-1));
}
# nvcc -rdc=true t48a.cu t48b.cu -o t48 -lineinfo
nvlink warning : Stack size for entry function '_Z1kii' cannot be statically determined
# compute-sanitizer ./t48
========= COMPUTE-SANITIZER
========= Stack overflow
=========     at 0x10 in /root/bobc/t48b.cu:1:a(int, int)
=========     by thread (0,0,0) in block (0,0,0)
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x390]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x390]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x310]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48a.cu:6:k(int, int) [0x80]
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x32e950]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:libcudart_static_4d8b33a106dceb3c07a56e26de61f2d53bb62a68 [0x1093e]
=========                in /root/bobc/./t48
=========     Host Frame:cudaLaunchKernel [0x70b4e]
=========                in /root/bobc/./t48
=========     Host Frame:cudaError cudaLaunchKernel<char>(char const*, dim3, dim3, void**, unsigned long, CUstream_st*) [0xb14e]
=========                in /root/bobc/./t48
=========     Host Frame:__device_stub__Z1kii(int, int) [0xb024]
=========                in /root/bobc/./t48
=========     Host Frame:k(int, int) [0xb05f]
=========                in /root/bobc/./t48
=========     Host Frame:main [0xae53]
=========                in /root/bobc/./t48
=========     Host Frame: [0x29d90]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:__libc_start_main [0x29e40]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0xab05]
=========                in /root/bobc/./t48
=========
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x47e786]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:cudaDeviceSynchronize [0x48a64]
=========                in /root/bobc/./t48
=========     Host Frame:main [0xae58]
=========                in /root/bobc/./t48
=========     Host Frame: [0x29d90]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:__libc_start_main [0x29e40]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0xab05]
=========                in /root/bobc/./t48
=========
unspecified launch failure
========= ERROR SUMMARY: 2 errors
#

(CUDA 12.2)
NOTES:

I lifted the recursive function definition from here (also see here).
The compiler seems to be quite good at converting my naive attempts at recursion into a partially unrolled loop.
In the above example, the error message printout at the end (“unspecified launch failure”) may change to some other message (e.g. “an illegal memory access was encountered”) if the code is not run under compute-sanitizer. This is an indication that compute-sanitizer instruments the code and machine in such a way that the fault is detected differently. But in the general case I know of no guarantees that a runtime fault will be triggered in the event of stack overflow, without the use of a tool like compute-sanitizer. YMMV

m96 · September 21, 2023, 1:05pm

Hi Robert, first of all thank you for the quick and very detailed response.
During the compilation I get this warning, that is the same you obtain in the above example.

ptxas warning : Stack size for entry function ‘_Z22SetTDState_rhoe_KernelP9CCoolPropP9HelmholtzPdS3_S3_S3_S3_S3_S3_S3_S3_S3_m’ cannot be statically determined

For you, what could be a path to follow to eliminate this problem, if it can be eliminated?

Robert_Crovella · September 21, 2023, 1:16pm

The first thing I would do is demangle that function name and confirm it matches the kernel you have shown (I think it does).

The next thing I would do is look, in the scope of that function/kernel, to see if there is any evidence of recursion. You have enough function calls and method calls in that kernel that I don’t think that can be completely done using the code you have shown here, but based on the code you have shown here, I didn’t spot any obvious recursion.

If there isn’t any recursion, it becomes a head scratcher for me. I cannot always spell out the complete debugging path when I don’t actually have the code to work with. It might be useful at that point to dust off cuda-gdb and see if I can glean any information by watching the code behavior while executing a test case that leads up to a failure. Unit 12 here may help you to get started with debugging. You would want to enable the memory checking facility in cuda-gdb, and if you wanted to you could start by just letting it run to fault, and then studying the backtrace to see what code sequence immediately preceded the fault. If you find cuda-gdb to be clumsy, you could also possibly use the backtrace info to sprinkle printf in your kernel code to see data as it is being modified up to the fault point, however this can be tricky in the presence of a machine fault like this, so the first thing you might have to do in that case is properly guard against the fault - i.e. use boolean code to detect when the fault is about to occur and prevent execution (e.g. return) at that point.

If there is recursion, then I would study the use of that recursion to see if I could:

understand its depth
replace it with an equivalent iterative (i.e. loop-driven) solution

In any case, as you have already discovered, you can try adjusting the stack size per thread using cudaDeviceSetLimit(cudaLimitStackSize, ...);

m96 · September 25, 2023, 3:30pm

Hi Robert,
if you want I can share the program with you but I don’t know if I will be able to due to the large size of the program, about 1 GB.
In the meantime, however, I did some tests.
In the first message you can see the kernel calculating the thermodynamic variables on a grid of points of size nPoint.

Going for a test launch with changing the stack size:

    cudaError_t error;
    error = cudaDeviceSetLimit(cudaLimitStackSize, 64 * 1024);
    if (error != cudaSuccess) {
        printf("cudaDeviceSetLimit failed with %d, line(%d)\n", error, __LINE__);
        exit(EXIT_FAILURE);
    }

    size_t p_val;
    error = cudaDeviceGetLimit(&p_val, cudaLimitStackSize);
    if (error != cudaSuccess) {
        printf("cudaDeviceGetLimit failed with %d, line(%d)\n", error, __LINE__);
        exit(EXIT_FAILURE);
    }

    printf("stack size limit: %ld\n", p_val);

and using 1 thread per block and 1block for grid:

	SetTDState_rhoe_Kernel
				<<<1,1>>>(
				h_CoolProp, h_HELMHOLTZ, kDensity, kEnergy, kPressure, kTemperature,
				kSoundSpeed2, kDpDrho_e, kDpDe_rho, kDtDe_rhoe, kDtDrho_e, kEntropy,nPoint);

the code works.But as you can imagine, by doing this he calculates the thermodynamic properties of a single point. if I try to increase the number of threads :
` dim3 dimBlock(32,1);
dim3 dimGrid((nPoint + dimBlock.x - 1) / dimBlock.x, 1);

SetTDState_rhoe_Kernel
			<<<dimGrid,dimBlock>>>(
			h_CoolProp, h_HELMHOLTZ, kDensity, kEnergy, kPressure, kTemperature,
			kSoundSpeed2, kDpDrho_e, kDpDe_rho, kDtDe_rhoe, kDtDrho_e, kEntropy,nPoint);

`

it starts giving errors which however are related to the code. If I try to launch the program with cuda compute-sanitizer it gives me zero errors. Since I have checked the code many times and it is correct, could it still be a problem related to too much GPU memory usage?

Robert_Crovella · September 25, 2023, 3:38pm

In my experience, such a problem would normally be flagged by compute-sanitizer. Since compute-sanitizer reports no errors with the increase in stack size, I would assume that the stack size problem is resolved, and the issues you have now may be due to coding bugs. I think ordinary debugging is probably in order. I cannot state such a thing categorically concerning code that I have not seen, it’s just mentally how I would approach such a problem. I do not assume or look for “unknown problems” until I have exhausted my methods for finding (and resolving) problems that I know how to find.

m96 · October 4, 2023, 1:32pm

Following your advice I checked the code more thoroughly and the error was in a parameter. Big thanks for the help!!

Topic		Replies	Views
Error cp.async.ca.shared.global CUDA Programming and Performance	12	1502	August 17, 2022
Unspecified launch failure error CUDA Programming and Performance	10	18606	January 6, 2018
Random failures in quick-sort. Compute-sanitizer says there's no error CUDA Programming and Performance	4	24	October 5, 2024
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	52493	June 7, 2022
Unknow error when calling device subroutine Legacy PGI Compilers	18	10480	September 1, 2015
[SOLVED]memory leak issues CUDA Programming and Performance	12	3642	October 12, 2015
cufftPlan1D initialisation hides subsequence memory access errors CUDA Programming and Performance	8	656	November 24, 2020
Upgrading to CUDA 12.4 broke down the application GPU-Accelerated Libraries cublas , cusparse	13	1177	July 21, 2024
CUDA thrust compilation error CUDA Programming and Performance	85	396	April 4, 2025
Passing dynamically allocated memory in kernel to sub kernel via dynamic parallelism CUDA Programming and Performance	6	693	May 3, 2019

Stack overflow error with compute-sanitizer

Related topics