Segmentation Fault -- Possibly Linked to Optimisation Level?

I’m developing a relatively large OpenCL code, and have hit a very weird problem.

My code runs to completion on a CPU (using AMD’s Stream SDK) and also on a Tesla C1060. However, when run on a Tesla C2050 (Fermi) I get a segmentation fault – the kind of segmentation fault that never seems to be in the same place twice. I thought I was probably doing something I shouldn’t have been doing with memory, but if that were the case then I would have expected the error to have presented itself on the other two architectures.

Weirder still, if I set “cl-nv-opt-level=1” in my compiler flags, then the error disappears. To reiterate, the code only segfaults on a Tesla C2050 if the optimisation level is set to 2 or higher.

I’ve had a look at the PTX code for my kernels (using clGetProgramInfo) and the PTX for optimisation level 1 is the same as that for level 3, suggesting that the optimisation flag is actually affecting the PTX → GPU assembly conversion rather than the one for OpenCL → PTX.

I’m working on reproducing the problem with a simpler case, but I was hoping that in the mean time somebody might have some suggestions regarding where the problem might lie or how I might go about debugging it. I’m running on Linux, so Nsight is a no-go.

Any help would be much appreciated; I’ve been sitting here scratching my head for most of the day!