Compiling the following proof-of-concept generates a segfault in NVCC: (after the insert, I have two fixes that do not crash the compiler)
[codebox]include <stdio.h>
define TMAX 32
global void cuKernel(float *d_flt, int dlen)
{
int nTiles = dlen/TMAX;
float lcl[nTiles]; /* This *should* be put in (slow) local memory */
int tid = threadIdx.x;
for (int i=0; i<nTiles; i++) lcl[i] = -1.e0;
//lcl[0] = -1.e0; /* COMMENT THIS LINE TO SEGFAULT THE COMPILER */
for (int i=0; i<nTiles; i++) d_flt[tid+i*TMAX] = lcl[0];
return;
}
define DLEN 1024
int main(int argc, char *argv)
{
float *h_flt, *d_flt;
float flt[DLEN];
int nBytes;
h_flt = &flt[0];
for (int i=0; i<DLEN; i++) h_flt[i] = i*1.e0;
nBytes = sizeof(float)*DLEN;
cudaMalloc((void**)&d_flt,nBytes);
cudaMemcpy(d_flt,h_flt,nBytes,cudaMemcpyHostToDevice);
{
dim3 dimGrd(1,1,1);
dim3 dimBlk(TMAX,1,1);
cuKernel<<<dimGrd,dimBlk>>>(d_flt,DLEN);
}
printf("The old ends are %f %f ... %f %f\n",
h_flt[0],h_flt[1],h_flt[DLEN-2],h_flt[DLEN-1]);
cudaMemcpy(h_flt,d_flt,nBytes,cudaMemcpyDeviceToHost);
printf("The new ends are %f %f ... %f %f\n",
h_flt[0],h_flt[1],h_flt[DLEN-2],h_flt[DLEN-1]);
cudaFree(d_flt);
return 0;
}
[/codebox]
Fix 1: (Just to get the code to compile)
Uncomment the hardcoded lcl[0] = -1.e0 on line 12.
Fix 2: Declare a maximum local stack size.
Replace “nTiles” on lines 9 and 11 with a number like 64 or 128.
Info:
I have an algorithm that needs a runtime-defined amount of temporary local memory per thread block. Since I have no idea how many TBs are running at once, I chose to attempt to distribute the stack across the threads in local memory. If I try to allocate all temporary arrays from the host into global memory, then I will exhaust the memory on the card.
I could set a hardcoded stack limit (like Fix 2 above) or I could allocate a handful of global arrays and let the active TBs attempt to grab them with mutex locks and atomics. Both of which I am prepared to do.
I just thought that I would post this issue in case it reveals a bug in the optimizer or something.
Thanks, All.
Anthony
The error message:
nvcc main.cu
Signal: Segmentation fault in Code_Expansion phase.
(0): Error: Signal Segmentation fault in phase Code_Expansion – processing aborted
*** Internal stack backtrace:
/opt/cuda/cuda/open64/lib//be [0x6c09bf]
/opt/cuda/cuda/open64/lib//be [0x6c1609]
/opt/cuda/cuda/open64/lib//be [0x6c0d5d]
/opt/cuda/cuda/open64/lib//be [0x6c1fa6]
/lib64/libc.so.6 [0x3bfd2301b0]
/opt/cuda/cuda/open64/lib//be [0x544449]
/opt/cuda/cuda/open64/lib//be [0x56fcec]
/opt/cuda/cuda/open64/lib//be [0x56c7bc]
/opt/cuda/cuda/open64/lib//be [0x56dc31]
/opt/cuda/cuda/open64/lib//be [0x56c7a8]
/opt/cuda/cuda/open64/lib//be [0x57132b]
/opt/cuda/cuda/open64/lib//be [0x571e46]
/opt/cuda/cuda/open64/lib//be [0x5723fc]
/opt/cuda/cuda/open64/lib//be [0x54e980]
/opt/cuda/cuda/open64/lib//be [0x405443]
/opt/cuda/cuda/open64/lib//be [0x4061f1]
/opt/cuda/cuda/open64/lib//be [0x40751d]
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3bfd21d8b4]
/opt/cuda/cuda/open64/lib//be [0x4038da]
nvopencc INTERNAL ERROR: /opt/cuda/cuda/open64/lib//be died due to signal 4
My environment:
nvcc -V
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2009 NVIDIA Corporation
Built on Thu_Jul_30_09:24:36_PDT_2009
Cuda compilation tools, release 2.3, V0.2.1221
dmesg | head -1
Linux version 2.6.18-92.1.6.el5 (brewbuilder@norob.fnal.gov) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)) #1 SMP Wed Jun 25 12:38:37 EDT 2008
gcc -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: …/configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux
Thread model: posix
gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)
Other stuff:
GeForce 9600 GT, Pentium D dual-core, blah, blah, blah. I don’t think the machine is affecting the compiler.