nvcc Segmentation Fault (addressing issue with stack vars?) local to global gather

Compiling the following proof-of-concept generates a segfault in NVCC: (after the insert, I have two fixes that do not crash the compiler)

[codebox]include <stdio.h>

define TMAX 32

global void cuKernel(float *d_flt, int dlen)

{

int nTiles = dlen/TMAX;

float lcl[nTiles]; /* This *should* be put in (slow) local memory */

int tid = threadIdx.x;

for (int i=0; i<nTiles; i++) lcl[i] = -1.e0;

//lcl[0] = -1.e0; /* COMMENT THIS LINE TO SEGFAULT THE COMPILER */

for (int i=0; i<nTiles; i++) d_flt[tid+i*TMAX] = lcl[0];

return;

}

define DLEN 1024

int main(int argc, char *argv)

{

float *h_flt, *d_flt;

float flt[DLEN];

int nBytes;

h_flt = &flt[0];

for (int i=0; i<DLEN; i++) h_flt[i] = i*1.e0;

nBytes = sizeof(float)*DLEN;

cudaMalloc((void**)&d_flt,nBytes);

cudaMemcpy(d_flt,h_flt,nBytes,cudaMemcpyHostToDevice);

{

    dim3 dimGrd(1,1,1);

    dim3 dimBlk(TMAX,1,1);

    cuKernel<<<dimGrd,dimBlk>>>(d_flt,DLEN);

}

printf("The old ends are %f %f ... %f %f\n",

       h_flt[0],h_flt[1],h_flt[DLEN-2],h_flt[DLEN-1]);

cudaMemcpy(h_flt,d_flt,nBytes,cudaMemcpyDeviceToHost);

printf("The new ends are %f %f ... %f %f\n",

       h_flt[0],h_flt[1],h_flt[DLEN-2],h_flt[DLEN-1]);

cudaFree(d_flt);

return 0;

}

[/codebox]

Fix 1: (Just to get the code to compile)

Uncomment the hardcoded lcl[0] = -1.e0 on line 12.

Fix 2: Declare a maximum local stack size.

Replace “nTiles” on lines 9 and 11 with a number like 64 or 128.

Info:

I have an algorithm that needs a runtime-defined amount of temporary local memory per thread block. Since I have no idea how many TBs are running at once, I chose to attempt to distribute the stack across the threads in local memory. If I try to allocate all temporary arrays from the host into global memory, then I will exhaust the memory on the card.

I could set a hardcoded stack limit (like Fix 2 above) or I could allocate a handful of global arrays and let the active TBs attempt to grab them with mutex locks and atomics. Both of which I am prepared to do.

I just thought that I would post this issue in case it reveals a bug in the optimizer or something.

Thanks, All.

Anthony

The error message:

nvcc main.cu

Signal: Segmentation fault in Code_Expansion phase.

(0): Error: Signal Segmentation fault in phase Code_Expansion – processing aborted

*** Internal stack backtrace:

/opt/cuda/cuda/open64/lib//be [0x6c09bf]

/opt/cuda/cuda/open64/lib//be [0x6c1609]

/opt/cuda/cuda/open64/lib//be [0x6c0d5d]

/opt/cuda/cuda/open64/lib//be [0x6c1fa6]

/lib64/libc.so.6 [0x3bfd2301b0]

/opt/cuda/cuda/open64/lib//be [0x544449]

/opt/cuda/cuda/open64/lib//be [0x56fcec]

/opt/cuda/cuda/open64/lib//be [0x56c7bc]

/opt/cuda/cuda/open64/lib//be [0x56dc31]

/opt/cuda/cuda/open64/lib//be [0x56c7a8]

/opt/cuda/cuda/open64/lib//be [0x57132b]

/opt/cuda/cuda/open64/lib//be [0x571e46]

/opt/cuda/cuda/open64/lib//be [0x5723fc]

/opt/cuda/cuda/open64/lib//be [0x54e980]

/opt/cuda/cuda/open64/lib//be [0x405443]

/opt/cuda/cuda/open64/lib//be [0x4061f1]

/opt/cuda/cuda/open64/lib//be [0x40751d]

/lib64/libc.so.6(__libc_start_main+0xf4) [0x3bfd21d8b4]

/opt/cuda/cuda/open64/lib//be [0x4038da]

nvopencc INTERNAL ERROR: /opt/cuda/cuda/open64/lib//be died due to signal 4

My environment:

nvcc -V

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2009 NVIDIA Corporation

Built on Thu_Jul_30_09:24:36_PDT_2009

Cuda compilation tools, release 2.3, V0.2.1221

dmesg | head -1

Linux version 2.6.18-92.1.6.el5 (brewbuilder@norob.fnal.gov) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)) #1 SMP Wed Jun 25 12:38:37 EDT 2008

gcc -v

Using built-in specs.

Target: x86_64-redhat-linux

Configured with: …/configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux

Thread model: posix

gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)

Other stuff:

GeForce 9600 GT, Pentium D dual-core, blah, blah, blah. I don’t think the machine is affecting the compiler.

Hi. I had the same error compiling my program.

The error dissapears compiling with option -deviceemu.

Do you solve it with another solution?

Thks.

While I am surprised it causes the compiler to fail, the underlying problem is actually illegal syntax in the CUDA code. This:

int nTiles = dlen/TMAX;

float lcl[nTiles]; /* This *should* be put in (slow) local memory */

make lcl a dynamically allocated array, because dlen isn’t known at compile time. That sort of dynamic declaration in C90 is not permitted. In C99 it is, but CUDA doesn’t/shouldn’t support those features - at the moment device side dynamic memory allocation inside kernels is not allowed. My guess is that nvopencc is trying to compile it, but winds up in some dark and dusty piece of parser/generator code that shouldn’t be active and blows up as a result. It shouldn’t fail like that, but the syntax it is trying to compile is the underlying reason.

Fix your code so that it complies with the C90 and CUDA language requirements, and the problem will go away.

I met a similar error some month ago, but the compiler told me something like ‘VLA (variable length array) not supported’

and so I understood the problem.

If I need to use an array witch dimension depends from an extern parameter, what could I do?

Thanks again.

For a thread local array, you can’t do it. CUDA doesn’t supported (in fact C90 doesn’t support it). You only options are use shared memory (which you can dynamically allocate as part of the kernel call), use global memory with an indexing scheme that gives each thread a chunk of it, or declare a fixed length array which will be big enough to hold every case, and live with the wasted memory in cases where it is too large.

Yes. I thought at the same thing! Ths

This seems to be the minimal program which reproduces the compiler crash:

[codebox]

global void foo(int *x, int N)

{

int y[N];

*x = y[0];

return;

}

int main(void)

{

return 0;

}

[/codebox]

I’ll file a bug.

Did you manage to find a way to allocate a per-thread array like this?

Did you manage to find a way to allocate a per-thread array like this?