nvcc Segmentation Fault (addressing issue with stack vars?) local to global gather

AnthonyYau · October 13, 2009, 4:23pm

Compiling the following proof-of-concept generates a segfault in NVCC: (after the insert, I have two fixes that do not crash the compiler)

[codebox]include <stdio.h>

define TMAX 32

global void cuKernel(float *d_flt, int dlen)

{

int nTiles = dlen/TMAX;

float lcl[nTiles]; /* This *should* be put in (slow) local memory */

int tid = threadIdx.x;

for (int i=0; i<nTiles; i++) lcl[i] = -1.e0;

//lcl[0] = -1.e0; /* COMMENT THIS LINE TO SEGFAULT THE COMPILER */

for (int i=0; i<nTiles; i++) d_flt[tid+i*TMAX] = lcl[0];

return;

}

define DLEN 1024

int main(int argc, char *argv)

{

float *h_flt, *d_flt;

float flt[DLEN];

int nBytes;

h_flt = &flt[0];

for (int i=0; i<DLEN; i++) h_flt[i] = i*1.e0;

nBytes = sizeof(float)*DLEN;

cudaMalloc((void**)&d_flt,nBytes);

cudaMemcpy(d_flt,h_flt,nBytes,cudaMemcpyHostToDevice);

{

    dim3 dimGrd(1,1,1);

    dim3 dimBlk(TMAX,1,1);

    cuKernel<<<dimGrd,dimBlk>>>(d_flt,DLEN);

}

printf("The old ends are %f %f ... %f %f\n",

       h_flt[0],h_flt[1],h_flt[DLEN-2],h_flt[DLEN-1]);

cudaMemcpy(h_flt,d_flt,nBytes,cudaMemcpyDeviceToHost);

printf("The new ends are %f %f ... %f %f\n",

       h_flt[0],h_flt[1],h_flt[DLEN-2],h_flt[DLEN-1]);

cudaFree(d_flt);

return 0;

}

[/codebox]

Fix 1: (Just to get the code to compile)

Uncomment the hardcoded lcl[0] = -1.e0 on line 12.

Fix 2: Declare a maximum local stack size.

Replace “nTiles” on lines 9 and 11 with a number like 64 or 128.

Info:

I have an algorithm that needs a runtime-defined amount of temporary local memory per thread block. Since I have no idea how many TBs are running at once, I chose to attempt to distribute the stack across the threads in local memory. If I try to allocate all temporary arrays from the host into global memory, then I will exhaust the memory on the card.

I could set a hardcoded stack limit (like Fix 2 above) or I could allocate a handful of global arrays and let the active TBs attempt to grab them with mutex locks and atomics. Both of which I am prepared to do.

I just thought that I would post this issue in case it reveals a bug in the optimizer or something.

Thanks, All.

Anthony

The error message:

nvcc main.cu

Signal: Segmentation fault in Code_Expansion phase.

(0): Error: Signal Segmentation fault in phase Code_Expansion – processing aborted

*** Internal stack backtrace:

/opt/cuda/cuda/open64/lib//be [0x6c09bf]

/opt/cuda/cuda/open64/lib//be [0x6c1609]

/opt/cuda/cuda/open64/lib//be [0x6c0d5d]

/opt/cuda/cuda/open64/lib//be [0x6c1fa6]

/lib64/libc.so.6 [0x3bfd2301b0]

/opt/cuda/cuda/open64/lib//be [0x544449]

/opt/cuda/cuda/open64/lib//be [0x56fcec]

/opt/cuda/cuda/open64/lib//be [0x56c7bc]

/opt/cuda/cuda/open64/lib//be [0x56dc31]

/opt/cuda/cuda/open64/lib//be [0x56c7a8]

/opt/cuda/cuda/open64/lib//be [0x57132b]

/opt/cuda/cuda/open64/lib//be [0x571e46]

/opt/cuda/cuda/open64/lib//be [0x5723fc]

/opt/cuda/cuda/open64/lib//be [0x54e980]

/opt/cuda/cuda/open64/lib//be [0x405443]

/opt/cuda/cuda/open64/lib//be [0x4061f1]

/opt/cuda/cuda/open64/lib//be [0x40751d]

/lib64/libc.so.6(__libc_start_main+0xf4) [0x3bfd21d8b4]

/opt/cuda/cuda/open64/lib//be [0x4038da]

nvopencc INTERNAL ERROR: /opt/cuda/cuda/open64/lib//be died due to signal 4

My environment:

nvcc -V

nvcc: NVIDIA Â® Cuda compiler driver

Built on Thu_Jul_30_09:24:36_PDT_2009

Cuda compilation tools, release 2.3, V0.2.1221

dmesg | head -1

Linux version 2.6.18-92.1.6.el5 (brewbuilder@norob.fnal.gov) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)) #1 SMP Wed Jun 25 12:38:37 EDT 2008

gcc -v

Using built-in specs.

Target: x86_64-redhat-linux

Configured with: …/configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux

Thread model: posix

gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)

Other stuff:

GeForce 9600 GT, Pentium D dual-core, blah, blah, blah. I don’t think the machine is affecting the compiler.

scotch · January 23, 2010, 9:16am

Hi. I had the same error compiling my program.

The error dissapears compiling with option -deviceemu.

Do you solve it with another solution?

Thks.

Compiling the following proof-of-concept generates a segfault in NVCC: (after the insert, I have two fixes that do not crash the compiler)

[codebox]include <stdio.h>

define TMAX 32

global void cuKernel(float *d_flt, int dlen)

{
int nTiles = dlen/TMAX;

float lcl[nTiles]; /* This *should* be put in (slow) local memory */

int tid = threadIdx.x;

for (int i=0; i<nTiles; i++) lcl[i] = -1.e0;

//lcl[0] = -1.e0; /* COMMENT THIS LINE TO SEGFAULT THE COMPILER */

for (int i=0; i<nTiles; i++) d_flt[tid+i*TMAX] = lcl[0];

return;
}

define DLEN 1024

int main(int argc, char *argv)

{
float *h_flt, *d_flt;

float flt[DLEN];

int nBytes;

h_flt = &flt[0];

for (int i=0; i<DLEN; i++) h_flt[i] = i*1.e0;

nBytes = sizeof(float)*DLEN;

cudaMalloc((void**)&d_flt,nBytes);

cudaMemcpy(d_flt,h_flt,nBytes,cudaMemcpyHostToDevice);

{

    dim3 dimGrd(1,1,1);

    dim3 dimBlk(TMAX,1,1);

    cuKernel<<<dimGrd,dimBlk>>>(d_flt,DLEN);

}

printf("The old ends are %f %f ... %f %f\n",

       h_flt[0],h_flt[1],h_flt[DLEN-2],h_flt[DLEN-1]);

cudaMemcpy(h_flt,d_flt,nBytes,cudaMemcpyDeviceToHost);

printf("The new ends are %f %f ... %f %f\n",

       h_flt[0],h_flt[1],h_flt[DLEN-2],h_flt[DLEN-1]);

cudaFree(d_flt);

return 0;
}

[/codebox]

Fix 1: (Just to get the code to compile)

Uncomment the hardcoded lcl[0] = -1.e0 on line 12.

Fix 2: Declare a maximum local stack size.

Replace “nTiles” on lines 9 and 11 with a number like 64 or 128.

Info:

I have an algorithm that needs a runtime-defined amount of temporary local memory per thread block. Since I have no idea how many TBs are running at once, I chose to attempt to distribute the stack across the threads in local memory. If I try to allocate all temporary arrays from the host into global memory, then I will exhaust the memory on the card.

I could set a hardcoded stack limit (like Fix 2 above) or I could allocate a handful of global arrays and let the active TBs attempt to grab them with mutex locks and atomics. Both of which I am prepared to do.

I just thought that I would post this issue in case it reveals a bug in the optimizer or something.

Thanks, All.

Anthony

The error message:

nvcc main.cu

Signal: Segmentation fault in Code_Expansion phase.

(0): Error: Signal Segmentation fault in phase Code_Expansion – processing aborted

*** Internal stack backtrace:
/opt/cuda/cuda/open64/lib//be [0x6c09bf]

/opt/cuda/cuda/open64/lib//be [0x6c1609]

/opt/cuda/cuda/open64/lib//be [0x6c0d5d]

/opt/cuda/cuda/open64/lib//be [0x6c1fa6]

/lib64/libc.so.6 [0x3bfd2301b0]

/opt/cuda/cuda/open64/lib//be [0x544449]

/opt/cuda/cuda/open64/lib//be [0x56fcec]

/opt/cuda/cuda/open64/lib//be [0x56c7bc]

/opt/cuda/cuda/open64/lib//be [0x56dc31]

/opt/cuda/cuda/open64/lib//be [0x56c7a8]

/opt/cuda/cuda/open64/lib//be [0x57132b]

/opt/cuda/cuda/open64/lib//be [0x571e46]

/opt/cuda/cuda/open64/lib//be [0x5723fc]

/opt/cuda/cuda/open64/lib//be [0x54e980]

/opt/cuda/cuda/open64/lib//be [0x405443]

/opt/cuda/cuda/open64/lib//be [0x4061f1]

/opt/cuda/cuda/open64/lib//be [0x40751d]

/lib64/libc.so.6(__libc_start_main+0xf4) [0x3bfd21d8b4]

/opt/cuda/cuda/open64/lib//be [0x4038da]
nvopencc INTERNAL ERROR: /opt/cuda/cuda/open64/lib//be died due to signal 4

My environment:

nvcc -V

nvcc: NVIDIA Â® Cuda compiler driver

Copyright Â© 2005-2009 NVIDIA Corporation

Built on Thu_Jul_30_09:24:36_PDT_2009

Cuda compilation tools, release 2.3, V0.2.1221

dmesg | head -1

Linux version 2.6.18-92.1.6.el5 (brewbuilder@norob.fnal.gov) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)) #1 SMP Wed Jun 25 12:38:37 EDT 2008

gcc -v

Using built-in specs.

Target: x86_64-redhat-linux

Configured with: …/configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux

Thread model: posix

gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)

Other stuff:

GeForce 9600 GT, Pentium D dual-core, blah, blah, blah. I don’t think the machine is affecting the compiler.

avidday · January 23, 2010, 10:16am

While I am surprised it causes the compiler to fail, the underlying problem is actually illegal syntax in the CUDA code. This:

int nTiles = dlen/TMAX;

float lcl[nTiles]; /* This *should* be put in (slow) local memory */

make lcl a dynamically allocated array, because dlen isn’t known at compile time. That sort of dynamic declaration in C90 is not permitted. In C99 it is, but CUDA doesn’t/shouldn’t support those features - at the moment device side dynamic memory allocation inside kernels is not allowed. My guess is that nvopencc is trying to compile it, but winds up in some dark and dusty piece of parser/generator code that shouldn’t be active and blows up as a result. It shouldn’t fail like that, but the syntax it is trying to compile is the underlying reason.

Fix your code so that it complies with the C90 and CUDA language requirements, and the problem will go away.

scotch · January 27, 2010, 8:53pm

I met a similar error some month ago, but the compiler told me something like ‘VLA (variable length array) not supported’

and so I understood the problem.

If I need to use an array witch dimension depends from an extern parameter, what could I do?

Thanks again.

While I am surprised it causes the compiler to fail, the underlying problem is actually illegal syntax in the CUDA code. This:
int nTiles = dlen/TMAX;

float lcl[nTiles]; /* This *should* be put in (slow) local memory */
make lcl a dynamically allocated array, because dlen isn’t known at compile time. That sort of dynamic declaration in C90 is not permitted. In C99 it is, but CUDA doesn’t/shouldn’t support those features - at the moment device side dynamic memory allocation inside kernels is not allowed. My guess is that nvopencc is trying to compile it, but winds up in some dark and dusty piece of parser/generator code that shouldn’t be active and blows up as a result. It shouldn’t fail like that, but the syntax it is trying to compile is the underlying reason.

Fix your code so that it complies with the C90 and CUDA language requirements, and the problem will go away.

avidday · January 27, 2010, 9:05pm

For a thread local array, you can’t do it. CUDA doesn’t supported (in fact C90 doesn’t support it). You only options are use shared memory (which you can dynamically allocate as part of the kernel call), use global memory with an indexing scheme that gives each thread a chunk of it, or declare a fixed length array which will be big enough to hold every case, and live with the wasted memory in cases where it is too large.

scotch · January 30, 2010, 1:51pm

Yes. I thought at the same thing! Ths

JaredHoberock · January 31, 2010, 12:40pm

This seems to be the minimal program which reproduces the compiler crash:

[codebox]

global void foo(int *x, int N)

{

int y[N];

*x = y[0];

return;

}

int main(void)

{

return 0;

}

[/codebox]

I’ll file a bug.

sn3rd · August 15, 2010, 4:49pm

Did you manage to find a way to allocate a per-thread array like this?

sn3rd · August 15, 2010, 4:49pm

Did you manage to find a way to allocate a per-thread array like this?

Topic		Replies	Views
nvcc Segfault CUDA Programming and Performance	6	11442	October 14, 2010
Defining Array in Kernel CUDA Programming and Performance	26	3492	June 29, 2009
Compilation error CUDA Programming and Performance	4	966	April 17, 2010
nvcc segfault in Code_Expansion phase CUDA Programming and Performance	0	1595	June 24, 2010
nvcc may crash with partial array initializers CUDA Programming and Performance	1	1880	May 29, 2008
NVCC bug report: a runtime error CUDA Programming and Performance	7	6483	March 19, 2009
Shared Memory initialization CUDA Programming and Performance	19	45330	March 26, 2007
Simple code crashes nvopencc be.exe crashes with stack overflow or out of memory CUDA Programming and Performance	3	7232	April 9, 2011
Unable to Allocate on Memory and Stack Overflow CUDA Programming and Performance	11	2588	September 6, 2011
Declaring local arrays in device code Legacy PGI Compilers	16	9116	June 8, 2012

nvcc Segmentation Fault (addressing issue with stack vars?) local to global gather

Related topics