kernel only executes successfully once, then cudaMemcpy segfaults

Hi all-

Looking for some advice from the gurus…I have a host function that repeatedly calls a kernel. The kernel is executed once at the start of the program, then iteratively in a for-loop. The basic flow is as follows:

/////////////////////////////////////
// initial call
cudaMalloc( (void**)&in, sizeof(in) );
cudaMemcpy( in_d, in, sizeof(in), cudaMemcpyHostToDevice );
cudaMalloc( (void**)&out, sizeof(out) );
cudaMemcpy( out_d, out, sizeof(out), cudaMemcpyHostToDevice );
dim3 dimBlock( BLOCKSIZE );
dim3 dimGrid( GRIDSIZE );
my_kernel<<<dimGrid,dimBlock>>>( in_d, out_d );
cudaMemcpy( out, out_d, sizeof(out), cudaMemcpyDeviceToHost );
cudaFree(in_d);
cudaFree(out_d);

// do something…
result[0] = func(out);

// iterative call
for ( i=0; i<n; i++ ) {
cudaMalloc( (void**)&in, sizeof(in) );
cudaMemcpy( in_d, in, sizeof(in), cudaMemcpyHostToDevice ); // **** segfaults here ****
cudaMalloc( (void**)&out, sizeof(out) );
cudaMemcpy( out_d, out, sizeof(out), cudaMemcpyHostToDevice );
dim3 dimBlock( BLOCKSIZE );
dim3 dimGrid( GRIDSIZE );
my_kernel<<<dimGrid,dimBlock>>>( in_d, out_d );
cudaMemcpy( out, out_d, sizeof(out), cudaMemcpyDeviceToHost );
cudaFree(in_d);
cudaFree(out_d);
// do something…
result[i] = func(out);
}
/////////////////////////////////////

under gdb I get the following backtrace:
Program received signal SIGSEGV, Segmentation fault.
0x000000000080b490 in cudaMemcpy () from /usr/local/cuda/lib/libcudart.so
Missing separate debuginfos, use: debuginfo-install gcc.x86_64 glibc.x86_64 zlib.x86_64
(gdb) where
#0 0x000000000080b490 in cudaMemcpy () from /usr/local/cuda/lib/libcudart.so
#1 0x000000000042060f in ga_host (obs=0x1001100, qs=0x1001000, wv=0x1000f00, xopt=0x7fffc7869b90, fopt=0x7fffc7869d54) at ga_host.cu:278
#2 0x0000000000414064 in invert (proc_id=1, scanline=278, pvec=0x7fffc7869de0) at ga_main.cu:494
#3 0x0000000000414503 in main () at ga_main.cu:70

I can’t seem to figure out why it would crap out only on the first time through the loop, especially since I explicitly free all the device memory I allocate and copy through the host. I’d appreciate any input whatsoever…I’m so close to getting my first CUDA code up and running I can taste it. Thanks!

sorry i am not a guru. ;)
at first you cudaMalloc pointer “in” but in_d seems to be your device pointer where did you allocate it?
besides cudaMalloc only allocates device memory you shpould use new or malloc for host memory

and before the second call you have deallocated the memory with cudaFree()
I think this should be the problem.

You have several problems. I think some of your cuda calls are probably returning error codes. Check that first.

Does the first time (outside the loop) give you correct results?