I wrote a simple code involving few arithmetic calculations to check the performance using CUDA on GPU.
I run this code on Tesla cluster. Following are details of cluster
Each Tesla S1070 1U chassis contains four (4) nVidia Tesla (C1060) GPUs and 16 GB of available RAM (4 GB per GPU).
my code is attached. You may compile on your machine.
cuda kernel basically sums up nine weights (wt_d[9]) and its sum must be equal to 1.0.
The sum is saved in three variables rho[idx], ux[idx], uy[idx]. All these data are written in 3 columns in data.dat.
The domain size is given by LX and LY in host code, which you can change to see that code works fine for smaller domain.
I was hoping to have more threads for larger domain size, but code doesn’t give correct answers when I increase block size >6.
f, rho, ux and uy are 1-d array. I am closing the file and clean all memory request at the end of code.
The result is fine (equal to 1.0) as expected when the block size (line 69 in code test.cu) equals to or less than 6. When I change block size to 64 (or any value greater than 6) to cut on computation time, the result is not always equal to 1.0.
Also the result changes every time you run the executable for block size > 6.
I will appreciate any help or suggestion in this regard.
Thanks for running my code on your machine. There is no specific location (node #) where error occurs. Also, the magnitude and location of error changes if you run the same executable several times. Please run the executable 5-6 times or change LX and LY in host code, and scroll through the data.dat file to look for any value not equal to 1.0. Let me know how it goes.
I am working on Linux cluster and I have attached my Makefile used to compile.