Stack overflow error with compute-sanitizer

Hi everyone, I’m modifying a library to make it possible to use the GPU. I modified the code to ensure that various functions can be executed directly within the GPU when called from my kernel.

  • the kernel :
__global__ void SetTDState_rhoe_Kernel(CCoolProp * d_CoolProp, Helmholtz *d_HELMHOLTZ, su2double *d_DENSITY,  su2double *d_ENERGY,su2double *d_PRESSURE, su2double *d_TEMPERATURE,
						su2double *d_SOUNDSPEED2, su2double *d_DPDRHOE, su2double *d_DPDERHO, su2double *d_DTDERHO, su2double *d_DTDRHOE, su2double *d_ENTROPY,
						unsigned long nPoint){

	int id= blockIdx.x * blockDim.x  + threadIdx.x;


// VERSIONE BASE

	if(id<nPoint){

		su2double Density = d_DENSITY[id];
		su2double StaticEnergy = d_ENERGY[id];
		d_CoolProp->update(d_HELMHOLTZ[id], DmassUmass_INPUTS, Density,
				StaticEnergy);

		__syncthreads();

		d_PRESSURE[id] = d_HELMHOLTZ[id].p();
		d_TEMPERATURE[id] = d_HELMHOLTZ[id].T();
		d_ENTROPY[id] = d_HELMHOLTZ[id].smass();
		d_DPDRHOE[id] = d_CoolProp[id].calc_first_partial_deriv(d_HELMHOLTZ[id],
				iP, iDmass, iUmass);
		d_DPDERHO[id] = d_CoolProp[id].calc_first_partial_deriv(d_HELMHOLTZ[id],
				iP, iUmass, iDmass);
		d_DTDRHOE[id] = d_CoolProp[id].calc_first_partial_deriv(d_HELMHOLTZ[id],
				iT, iDmass, iUmass);
		d_DTDERHO[id] = d_CoolProp[id].calc_first_partial_deriv(d_HELMHOLTZ[id],
				iT, iUmass, iDmass);
		if (d_HELMHOLTZ[id]._phase == iphase_twophase) {
			// impose gas phase
			d_TEMPERATURE[id] = d_TEMPERATURE[id] + 0.1;
			d_CoolProp[id].CheckPressure1(d_PRESSURE[id]);
			d_CoolProp[id].CheckTemperature1(d_TEMPERATURE[id]);
			d_CoolProp[id].update(d_HELMHOLTZ[id], PT_INPUTS, d_PRESSURE[id],
					d_TEMPERATURE[id]);
			d_SOUNDSPEED2[id] = pow(
					d_CoolProp[id].calc_speed_sound(d_HELMHOLTZ[id]), 2);
		} else {
			d_SOUNDSPEED2[id] = pow(
					d_CoolProp[id].calc_speed_sound(d_HELMHOLTZ[id]), 2);
		}

	}

}
  • The update function:
__device__ void CCoolProp::update(Helmholtz& HEOS, input_pairs pairs, su2double value1, su2double value2){

    su2double ld_value1 = value1, ld_value2 = value2;
    pre_update(HEOS, pairs, ld_value1, ld_value2);
    value1 = ld_value1;
    value2 = ld_value2;


    switch (pairs) {
    case PT_INPUTS:
        HEOS._p = value1;
        HEOS._T = value2;
        PT_flash(HEOS);
        break;

    case DmolarT_INPUTS:

        HEOS._rhomolar = value1;
        HEOS._T = value2;
        DHSU_T_flash(HEOS, iDmolar);
        break;
    case SmolarT_INPUTS:
        HEOS._smolar = value1;
        HEOS._T = value2;
        DHSU_T_flash(HEOS, iSmolar);
        break;
    case DmolarUmolar_INPUTS:

        HEOS._rhomolar = value1;
        HEOS._umolar = value2;
        HSU_D_flash(HEOS, iUmolar);
        break;
    case QT_INPUTS:
        HEOS._Q = value1;
        HEOS._T = value2;
        if ((HEOS._Q < 0) || (HEOS._Q > 1)) printf("Input vapor quality [Q] must be between 0 and 1 \n");
        QT_flash(HEOS);
        break;
    }
}

This is only a small part of the code, but if I run the program with the compute-sanitizer debugger I get:

Exit_CompSan.txt (106.2 KB)

For example if you catch the first mistake reported in the log file:

========= COMPUTE-SANITIZER
========= Stack overflow
========= at 0x4a9580 in /home/marco/CFD/SU2_CUDA_HELMHOLTZ/SU2_CFD/src/fluid/CCoolProp.cu:1384:SetTDState_rhoe_Kernel(CCoolProp *, Helmholtz *, double *, double *, double *, double *, double *, double *, double *, double *, double *, double *, unsigned long)
========= by thread (0,0,0) in block (12,0,0)

Where the function at line 1384 of CCoolProp.cu is:

__device__ void CCoolProp::QT_flash(Helmholtz& HEOS) {

       // The maximum possible saturation temperature
       // Critical point for pure fluids, slightly different for pseudo-pure, very different for mixtures
       CoolPropDbl Tmax_sat = HEOS.calc_Tmax_sat() + 1e-13;

       // Check what the minimum limits for the equation of state are
       CoolPropDbl Tmin_sat;
//       calc_Tmin_sat(HEOS, Tmin_satL, Tmin_satV);
       Tmin_sat = max(HEOS.sat_min_liquid.T, HEOS.sat_min_vapor.T) - 1e-13;

//        // Get a reference to keep the code a bit cleaner
//        const CriticalRegionSplines& splines = HEOS.components[0].EOS().critical_region_splines;

           // Set some input options
           saturation_T_pure_Akasaka_options options(false);

           // Actually call the solver
           saturation_T_pure_Maxwell(HEOS, HEOS._T, options);

           HEOS._p = 0.5 * HEOS.SatV->p() + 0.5 * HEOS.SatL->p();
           HEOS._rhomolar = 1 / (HEOS._Q / HEOS.SatV->rhomolar() + (1 - HEOS._Q) / HEOS.SatL->rhomolar());
       // Load the outputs
       HEOS._phase = iphase_twophase;

}

My question is : what kind of error is this :

========= COMPUTE-SANITIZER
========= Stack overflow
========= at 0x4a9580 in /home/marco/CFD/SU2_CUDA_HELMHOLTZ/SU2_CFD/src/fluid/CCoolProp.cu:1384:SetTDState_rhoe_Kernel(CCoolProp *, Helmholtz *, double *, double *, double *, double *, double *, double *, double *, double *, double *, double *, unsigned long)
========= by thread (0,0,0) in block (12,0,0)

Just like a CPU, each CUDA thread maintains a stack. The stack is of fixed size. If your code behaves in such a way that it uses more stack space than that fixed size, it will result in a stack overflow. The machine is not ordinarily guaranteed to automatically catch such an error the moment it happens (although the resultant corrupted program behavior may eventually trigger a machine fault of some kind), but the additional instrumentation in compute-sanitizer can detect it. here is a general description, not GPU-specific.

Along with the above general description, I would pay close attention to any compiler warning messages that are emitted when compiling your code. Sometimes compiler warning messages take the form of “… stack size … cannot be statically determined …” Any such messages may provide additional clues. I’m not suggesting that the compiler would always emit such a warning message, or that it is doing so in your case. However if there are any such warning messages, they may provide additional clues.

Here is an example program that triggers that stack overflow error in compute-sanitizer:

# cat t48a.cu
#include <cstdio>
#include <iostream>
#include <cstdlib>

__device__ int a(int m, int n);

__global__ void k(int x, int y){
  printf("%d\n", a(x,y));
}

int main(int argc, char *argv[]){

  int val = 9;  // 3 or higher seems to be enough to trigger the fault
  if (argc > 1) val = atoi(argv[1]);
  k<<<1,1>>>(val,val);
  cudaError_t err = cudaDeviceSynchronize();
  std::cout << cudaGetErrorString(err) << std::endl;
}
# cat t48b.cu
__device__ int a(int m, int n){
    if (m == 0)
        return n + 1;
    if ((m > 0) && (n == 0))
        return a(m-1,1);
    else
        return a(m-1,a(m,n-1));
}
# nvcc -rdc=true t48a.cu t48b.cu -o t48 -lineinfo
nvlink warning : Stack size for entry function '_Z1kii' cannot be statically determined
# compute-sanitizer ./t48
========= COMPUTE-SANITIZER
========= Stack overflow
=========     at 0x10 in /root/bobc/t48b.cu:1:a(int, int)
=========     by thread (0,0,0) in block (0,0,0)
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x390]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x390]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x310]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
=========     Device Frame:/root/bobc/t48a.cu:6:k(int, int) [0x80]
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x32e950]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:libcudart_static_4d8b33a106dceb3c07a56e26de61f2d53bb62a68 [0x1093e]
=========                in /root/bobc/./t48
=========     Host Frame:cudaLaunchKernel [0x70b4e]
=========                in /root/bobc/./t48
=========     Host Frame:cudaError cudaLaunchKernel<char>(char const*, dim3, dim3, void**, unsigned long, CUstream_st*) [0xb14e]
=========                in /root/bobc/./t48
=========     Host Frame:__device_stub__Z1kii(int, int) [0xb024]
=========                in /root/bobc/./t48
=========     Host Frame:k(int, int) [0xb05f]
=========                in /root/bobc/./t48
=========     Host Frame:main [0xae53]
=========                in /root/bobc/./t48
=========     Host Frame: [0x29d90]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:__libc_start_main [0x29e40]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0xab05]
=========                in /root/bobc/./t48
=========
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x47e786]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:cudaDeviceSynchronize [0x48a64]
=========                in /root/bobc/./t48
=========     Host Frame:main [0xae58]
=========                in /root/bobc/./t48
=========     Host Frame: [0x29d90]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:__libc_start_main [0x29e40]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0xab05]
=========                in /root/bobc/./t48
=========
unspecified launch failure
========= ERROR SUMMARY: 2 errors
#

(CUDA 12.2)
NOTES:

  • I lifted the recursive function definition from here (also see here).

  • The compiler seems to be quite good at converting my naive attempts at recursion into a partially unrolled loop.

  • In the above example, the error message printout at the end (“unspecified launch failure”) may change to some other message (e.g. “an illegal memory access was encountered”) if the code is not run under compute-sanitizer. This is an indication that compute-sanitizer instruments the code and machine in such a way that the fault is detected differently. But in the general case I know of no guarantees that a runtime fault will be triggered in the event of stack overflow, without the use of a tool like compute-sanitizer. YMMV

1 Like

Hi Robert, first of all thank you for the quick and very detailed response.
During the compilation I get this warning, that is the same you obtain in the above example.

ptxas warning : Stack size for entry function ‘_Z22SetTDState_rhoe_KernelP9CCoolPropP9HelmholtzPdS3_S3_S3_S3_S3_S3_S3_S3_S3_m’ cannot be statically determined

For you, what could be a path to follow to eliminate this problem, if it can be eliminated?

The first thing I would do is demangle that function name and confirm it matches the kernel you have shown (I think it does).

The next thing I would do is look, in the scope of that function/kernel, to see if there is any evidence of recursion. You have enough function calls and method calls in that kernel that I don’t think that can be completely done using the code you have shown here, but based on the code you have shown here, I didn’t spot any obvious recursion.

If there isn’t any recursion, it becomes a head scratcher for me. I cannot always spell out the complete debugging path when I don’t actually have the code to work with. It might be useful at that point to dust off cuda-gdb and see if I can glean any information by watching the code behavior while executing a test case that leads up to a failure. Unit 12 here may help you to get started with debugging. You would want to enable the memory checking facility in cuda-gdb, and if you wanted to you could start by just letting it run to fault, and then studying the backtrace to see what code sequence immediately preceded the fault. If you find cuda-gdb to be clumsy, you could also possibly use the backtrace info to sprinkle printf in your kernel code to see data as it is being modified up to the fault point, however this can be tricky in the presence of a machine fault like this, so the first thing you might have to do in that case is properly guard against the fault - i.e. use boolean code to detect when the fault is about to occur and prevent execution (e.g. return) at that point.

If there is recursion, then I would study the use of that recursion to see if I could:

  1. understand its depth
  2. replace it with an equivalent iterative (i.e. loop-driven) solution

In any case, as you have already discovered, you can try adjusting the stack size per thread using cudaDeviceSetLimit(cudaLimitStackSize, ...);

Hi Robert,
if you want I can share the program with you but I don’t know if I will be able to due to the large size of the program, about 1 GB.
In the meantime, however, I did some tests.
In the first message you can see the kernel calculating the thermodynamic variables on a grid of points of size nPoint.

Going for a test launch with changing the stack size:

    cudaError_t error;
    error = cudaDeviceSetLimit(cudaLimitStackSize, 64 * 1024);
    if (error != cudaSuccess) {
        printf("cudaDeviceSetLimit failed with %d, line(%d)\n", error, __LINE__);
        exit(EXIT_FAILURE);
    }

    size_t p_val;
    error = cudaDeviceGetLimit(&p_val, cudaLimitStackSize);
    if (error != cudaSuccess) {
        printf("cudaDeviceGetLimit failed with %d, line(%d)\n", error, __LINE__);
        exit(EXIT_FAILURE);
    }

    printf("stack size limit: %ld\n", p_val);

and using 1 thread per block and 1block for grid:

	SetTDState_rhoe_Kernel
				<<<1,1>>>(
				h_CoolProp, h_HELMHOLTZ, kDensity, kEnergy, kPressure, kTemperature,
				kSoundSpeed2, kDpDrho_e, kDpDe_rho, kDtDe_rhoe, kDtDrho_e, kEntropy,nPoint);

the code works.But as you can imagine, by doing this he calculates the thermodynamic properties of a single point. if I try to increase the number of threads :
` dim3 dimBlock(32,1);
dim3 dimGrid((nPoint + dimBlock.x - 1) / dimBlock.x, 1);

SetTDState_rhoe_Kernel
			<<<dimGrid,dimBlock>>>(
			h_CoolProp, h_HELMHOLTZ, kDensity, kEnergy, kPressure, kTemperature,
			kSoundSpeed2, kDpDrho_e, kDpDe_rho, kDtDe_rhoe, kDtDrho_e, kEntropy,nPoint);

`

it starts giving errors which however are related to the code. If I try to launch the program with cuda compute-sanitizer it gives me zero errors. Since I have checked the code many times and it is correct, could it still be a problem related to too much GPU memory usage?

In my experience, such a problem would normally be flagged by compute-sanitizer. Since compute-sanitizer reports no errors with the increase in stack size, I would assume that the stack size problem is resolved, and the issues you have now may be due to coding bugs. I think ordinary debugging is probably in order. I cannot state such a thing categorically concerning code that I have not seen, it’s just mentally how I would approach such a problem. I do not assume or look for “unknown problems” until I have exhausted my methods for finding (and resolving) problems that I know how to find.

Following your advice I checked the code more thoroughly and the error was in a parameter. Big thanks for the help!!