having problem with simpe CUDA code Code debug

I wrote a simple code involving few arithmetic calculations to check the performance using CUDA on GPU.

I run this code on Tesla cluster. Following are details of cluster
Each Tesla S1070 1U chassis contains four (4) nVidia Tesla (C1060) GPUs and 16 GB of available RAM (4 GB per GPU).

my code is attached. You may compile on your machine.

cuda kernel basically sums up nine weights (wt_d[9]) and its sum must be equal to 1.0.
The sum is saved in three variables rho[idx], ux[idx], uy[idx]. All these data are written in 3 columns in data.dat.
The domain size is given by LX and LY in host code, which you can change to see that code works fine for smaller domain.
I was hoping to have more threads for larger domain size, but code doesn’t give correct answers when I increase block size >6.
f, rho, ux and uy are 1-d array. I am closing the file and clean all memory request at the end of code.

The result is fine (equal to 1.0) as expected when the block size (line 69 in code test.cu) equals to or less than 6. When I change block size to 64 (or any value greater than 6) to cut on computation time, the result is not always equal to 1.0.

Also the result changes every time you run the executable for block size > 6.

I will appreciate any help or suggestion in this regard.

test.cu (2.5 KB)

I test your code in my tesla C1060 and GTX295 under cuda 2.3, driver 190.38

it works fine for block size = 64.

what is your platform and how do you compile your code?

can you specify which location of rho, ux, uy is wrong?

Sheng Chien,

Thanks for running my code on your machine. There is no specific location (node #) where error occurs. Also, the magnitude and location of error changes if you run the same executable several times. Please run the executable 5-6 times or change LX and LY in host code, and scroll through the data.dat file to look for any value not equal to 1.0. Let me know how it goes.

I am working on Linux cluster and I have attached my Makefile used to compile.


Makefile.txt (201 Bytes)

Dear Shadab:

I notice that some conflict in your code

prototype of kernel function

__global__ void incrementArrayOnDevice(float *f, float *rho, float *ux, float *uy, 

		int N, int LX, int LY)

and parameter sequence of caller

incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, rho_d, ux_d, uy_d,

							LX, LY, N);

I think that you should modify it as

incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, rho_d, ux_d, uy_d,

							N, LX, LY);

I have no idea why hte program works on my card.

please check this point.

Thanks Sheng Chien, for noticing this error. My code is working alright since I fixed the error you pointed.

I appreciate your help.