CUDA 2.3 bug? Strange compilation issue

I haven’t been able to create a simple reproduction of this, but I just spent about 2 hours fixing my code due to a bug that I think is optimization related (i.e. it happened with -O3 but not -O0). In one function, I had two for loops in succession that launched different kernels:

dim3 blockSize(i_max, 1, 1);

	dim3 gridSize(j_max, 1);

	for (int k = 0; k < k_max; k++) {

		solve1<<<gridSize, blockSize>>>(a, b, c, k);

	}

	for (int k = k_max - 1; k >= 0; k--) {

		solve2<<<gridSize, blockSize>>>(d, e, k, k_max);

	}

When compiling with -O3 on x86-64 SUSE linux with CUDA 2.3, I would get “invalid argument” errors on one or both of these kernels (depending on whether the data type was float or double). The variables a, b, c, d and e are arrays of floats or doubles. i_max, j_max, k, and k_max are integers. In the end, I solved the problem like this:

int k;

	dim3 blockSize(i_max, 1, 1);

	dim3 gridSize(j_max, 1);

	for (k = 0; k < k_max; k++) {

		solve1<<<gridSize, blockSize>>>(a, b, c, k);

	}

	for (k = k_max - 1; k >= 0; k--) {

		solve2<<<gridSize, blockSize>>>(d, e, k, k_max);

	}

EDIT: In a further twist, the next day the same code, compiled cleanly, gave the invalid argument error again. However, switching from -O3 to -O2 resolved the problem. I am inclined to say that I have a subtle, well-hidden memory bug somewhere in the code, but it frustratingly seems like a compiler issue.