invalid argument with kernel execution

Hi all,

In the code below I have a very simple kernel I’d like to execute many times in a for loop. When nsweeps = 9, everything works and the program exits with no error. If I increment nsweeps to anything beyond 9 (say, 10), then the program exits with error: invalid argument.

If I hardcode 10 kernel calls, no error. I get the same behavior on a 9500GT and a GTX 275 (with -arch=sm_13). Does anybody know what’s causing the invalid argument error? The FAQ says invalid argument has to do with exceeding the 256 bytes allowed for kernel arguments, but this obviously doesn’t apply to my situation.

#include<iostream>

#include<cuda_runtime.h>

#define L 128

int width = 1024, height = 1024;

const int nsweeps = 10;

int* h_data, * d_data;

__global__ void kernel(int* data, int width, int height)

{

   int tid = blockIdx.x*blockDim.x + threadIdx.x;

   if(tid < width*height) data[tid] += 1;

}

int main()

{  

   h_data = new int[width*height];

   cudaMalloc(&d_data,sizeof(int)*width*height);

cudaMemset(d_data,0,sizeof(int)*width*height);

for(int i = 0;i < nsweeps;i++)

   {

	  kernel<<<width*height/L + (width*height % L ? 1 : 0),L>>>(d_data,width,height); 

   }

cudaThreadSynchronize();

std::cout<<"Cleaning up with: "<<cudaGetErrorString(cudaGetLastError())<<"\n";

delete[] h_data;

   cudaFree(d_data);

}

I think I figured out the problem, but I don’t fully understand it.

I was getting strange error with -O3 optimization. When I switch to -O2 or below, I have no problems.

Anyone have an idea what O3 could be doing to generate this error? Is it generally unsafe to compile with O3?