kernel call issue

Hi, I’m developing a C++ project that calls a cuda kernel. now i’m getting a segmentation fault at the beggining of the kernel function.
here’s the backtrace:

Program received signal SIGSEGV, Segmentation fault.
0x00c939f2 in ?? () from /lib/tls/i686/cmov/libc.so.6
(gdb) bt
#0 0x00c939f2 in ?? () from /lib/tls/i686/cmov/libc.so.6
#1 0x00c95009 in calloc () from /lib/tls/i686/cmov/libc.so.6
#2 0x00a5a9ac in _dl_allocate_tls () from /lib/ld-linux.so.2
#3 0x009b3103 in pthread_create@@GLIBC_2.1 () from /lib/tls/i686/cmov/libpthread.so.0
#4 0x00891027 in ?? () from /usr/local/cuda/lib/libcudart.so.2
#5 0x0087ce38 in ?? () from /usr/local/cuda/lib/libcudart.so.2
#6 0x00874ced in cudaLaunch () from /usr/local/cuda/lib/libcudart.so.2
#7 0x0806688a in cudaLaunch (entry=0x8066ced “U\211\345\203\354\b\215E\b\211\004$\350!\353\376\377\311\303U\211\345\203\354\030\307D$\b\b”)
at /usr/local/cuda/bin/…/include/cuda_runtime.h:713
#8 0x0805581d in __device_stub__Z22LayerCalculationKernelILj512EL10VectorType
0ELS0_0EEvP12struct_Layer (__par0=0x8084100)
at /tmp/tmpxft_00000a80_00000000-1_paralelLayer.cudafe1.stub.c:308
#9 0x08055832 in __wrapper__device_stub_LayerCalculationKernel<512u, FLOAT, FLOAT> (__cuda_0=@0xbffff2b0)
at /tmp/tmpxft_00000a80_00000000-1_paralelLayer.cudafe1.stub.c:312
#10 0x08066cfe in LayerCalculationKernel__entry<512u, (VectorType)0, (VectorType)0> (layer=0x8084100) at paralelLayer.cu:157
#11 0x08066e79 in LayerCalculation<(VectorType)0, (VectorType)0> (d_layer=0x8084100, threads=512) at paralelLayer.cu:267
#12 0x08059eb9 in LayerCalculation (d_layer=0x8084100, threads=512, inputType=FLOAT, outputType=FLOAT) at paralelLayer.cu:294
#13 0x0806a204 in CudaLayer::calculateOutput (this=0x80794e0) at cudaLayer.cpp:74
#14 0x0806da37 in CudaNeuralNet::calculateOutput (this=0x8079498) at cudaNeuralNet.cpp:99
#15 0x0806f08e in main (argc=, argv=) at main.cpp:77

I’m developing in emulation mode. I give my code for any interested in it.
Any help would be great.
Any suggestion about the code or an optimization would be nice too.
The code tries to implement a neural network (without learning, because I’ll do it with genetic algorithms). A layer can point to different layers to keep the network “modular” while a layer can be processed in a parallel fashion. I wanted to use bits for the binary and bipolar step too (and use smaller weights).

Thanks
paralelLayer.cu (21 KB)

Ok, i’ve added the neccesary code to manage errors and i get the following error:

Invalid device pointer.

I suppose i’m doing something wrong with the pointers because i have a lot.