Segmentation fault Problem from 1.0 to 2.0 SUSE Enterprise

I am currently running CUDA on 8800 GTS on SLAMD64 (a 64-bit slackware distribution) using the SUSE Linux Enterprise Desktop driver and Toolkit. When I went from using the driver

NVIDIA-Linux-x86_64-100.14.11-pkg2.run to
NVIDIA-Linux-x86_64-177.73-pkg2.run

and the toolkit:

NVIDIA_CUDA_Toolkit_1.0_sled_x86_64.run to
NVIDIA_CUDA_Toolkit_2.0_sled10sp1_x86_64.run

a CUDA routine that previously ran successfully now has a segmentation error. The routine uses close to 97% of the available memory of the GPU device and when running with valgrind I get
“Conditional jump or move depends on uninitialised value” in cudaMemset and cudaFree that hadn’t occurred before. Was there a change to the driver/Toolkit btw 1.0 and 2.0 that might cause this error? Also are there further things I might do to diagnose the problem?

Thanks.

Are you checking your error codes?

How do I do that?

uhh, look at the reference manual–all (or at least the vast majority) of the cuda* functions return a cudaError_t.

I get an “unspecified launch failure” within this kernel:


#define BLOCK_SIZE_CMPT 256

global void compactScan(float *cmpt, float *shft,
float *scan, float *data, int *iaddr)
{
// Block index
int b_idx = blockIdx.x;

// Thread index
int t_idx = threadIdx.x;

int i = BLOCK_SIZE_CMPT*b_idx + t_idx;

if (iaddr[i] != 0) {
cmpt[iaddr[i]-1] = scan[i] + data[i];
shft[iaddr[i]] = scan[i] + data[i];
}
}

The screen also seems to flash quickly to black and back again when the error occurs.

Unspecified launch error is more often than not a segfault. Check that your memory is allocated via the error codes, and if that doesn’t help compile with -deviceemu and run valgrind.

All my CUDA calls are wrapped with the CUDA_SAFE_CALL() macro. I have received no error codes during memory allocation. I ran the code under valgrind after compiling with deviceemu. It executed successfully. The problem seems to be GPU memory sensitive. When I ran the code with a case where ~470 out of 670 MB of GPU memory were used it ran successfully. With a slightly different case where ~485 MB were used it had the segfault. What should I try next? I can send the code and the input dataset if that would be helpful.