Strange error when reading global memory


I’m using GTX 280 (Compute Capability 1.3) on Windows Vista 64 bit.

I have a strange error which I never understand. I’ll appreciate if you give any comments.

// in host function

// ... allocated about 150 MB global memory

float *check_array;											// for debug

int size_check = 500 * sizeof(float);

cudaMalloc((void**)&check_array, size_check);

dim3 dimBlock(256,1);

myKernel<<<1, dimBlock>>>(check_array, param);	// param is a parameter object.

// check the result passed from from "check_array"

So far, this is fine.

However, my problem occurs at:

__global__ void myKernel(float *check_array, Parameter *param) {	// param is a set of pointers that point global memories

		int na = 100;

		int nd = 80;

		int i, j;

		int pi = 1;

		int n;

		float dis1;

		float dis2;

		float minDis;

		float mcovar[3][3];

		float v[3][3];

		float x[3], y[3], z[3];

		//  The above are the only local declarations I have.

	   // ... some computations using "param" and above variables

	 /********** Error location **********/

   if (threadIdx.x == 0) {

	 check_array[0] = mcovar[0][0];

	 check_array[1] = mcovar[0][1];

	 check_array[2] = mcovar[0][2];

	 check_array[3] = mcovar[1][0];

	 check_array[4] = mcovar[1][1];

	 check_array[5] = mcovar[1][2];

	 check_array[6] = mcovar[2][0];

	 check_array[7] = mcovar[2][1];

	 check_array[8] = mcovar[2][2];




In the computation part, I have no “return” statement. So I believe every thread will reach “__syncthreads()”.

If I COMMENT OUT the assignments of “check_array”, it works fine.

Even if I assign another values (e.g., 0 or 1) to “check_array”, it works fine.

But, with the above code (with mcovar), the kernel does NOT run. This mean that actully there is another assignment (for debug) of check_array, but the array does not contain the debug value in this case.

Does anybody know what the problem is?

Please help me… I spent about 2 days in this problem.


How do you know the kernel does not run?

Are you checking for errors after the kernel launch?

I noticed you are having thread 0 read mcovar. Is mcovar set by other threads? If this is the case, you may need a __syncthreads() before the threadIdx.x == 0 conditional to ensure that the data from other threads is committed.

After the kernel launch, I copied the check_array (of global memory) back to host memory. Then, I printed out contents.

What I first expected from the code is that “mcovar” is allocated for each thread. In other words, each thread has its own “mcovar”. Am I wrong?

But, the problem is that even I allocated the mcovar as “device float mcovar[BLOCK_SIZE][3][3]”, I couldn’t get their values in the kernel.

BTW whether “mcovar” is set by other threads or not, I passed the writing part.

The only problem I guess is the “reading” section as I pointed out in the first post. If I commented out the “mcovar” writing section, then the kernel works fine. For example, if I assign “check_array[10] = 12345 (debug value)” without “mcovar” writing section, after the kernel launch, I can see the debug value 12345.

Note that the index is “10” for the debug value. If the “mcovar” writing from index 0 to index 8 is included, then I don’t see the value 12345 for index 10.

This is really strange…

ps. The website you gave me shows me a very similar way to my debugging method.

Sorry, I missed the fact that mcovar was a local variable (I guess it was too early in the morning for me). Each thread should in fact have its own ‘mcovar’

There might be something strange you are doing in the “//some computations” that are messing with your code.
Do you thing it would be possible to publish that part of code?

If not, try moving the if (threadIdx.x==0) part up your code and place a ‘return’ after it to disable the rest until you get meaningful results. That can help you locate your problem.
Remember however, that compiler will disable those parts of code which do not produce ultimately any results, e.g. assignments to local variables that are later never read.