Hi everyone, I have a GeForce 9500 GPU (I know it’s old, my new laptop will come within a week :D) and openSUSE 11.4 64bit and have written a simple CUDA program to solve poisson equation on a 2d grid.
here is the beginning of device code:
__global__ void CudaPoissonSolve(float* rho, float* out, int numPoints, float delta)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int idx = x + y * blockDim.x * gridDim.x;
and here’s how I call this code
extern "C" __host__ void CudaCaller(float* rho, float* out, int numPoints, float delta)
{
//Producing Wrong Results
dim3 grids((numPoints+15)/16, (numPoints+15)/16);
dim3 threads(16, 16);
//Producing Correct result, but very slow
/* dim3 grids(numPoints, numPoints);
dim3 threads(1,1);*/
float *dev_rho, *dev_out;
CUCHECK(cudaMalloc((void**)&dev_rho, numPoints * numPoints * sizeof(float)));
CUCHECK(cudaMalloc((void**)&dev_out, numPoints * numPoints * sizeof(float)));
CUCHECK(cudaMemcpy(dev_rho, rho, numPoints * numPoints * sizeof(float), cudaMemcpyHostToDevice));
CudaPoissonSolve<<<grids,threads>>>(dev_rho, dev_out, numPoints, delta);
CUCHECK(cudaMemcpy(out, dev_out, numPoints * numPoints * sizeof(float), cudaMemcpyDeviceToHost));
CUCHECK(cudaFree(dev_rho));
CUCHECK(cudaFree(dev_out));
}
I know it’s not optimal and not written very well, it’s just for my learning process! anyway the problem is if I comment out the definition of grid and threads and replace it by the commented ones the code works correctly but slowly.
but when I use this version it produces the incorrect result. Using cuda-gdb I figured out that the problem is in the
int y = threadIdx.y + blockIdx.y * blockDim.y;
it calculates the wrong y, for example threadIdx = 3, blockIdx.y = 1, blockDim.y = 16, the result should be 3 + 16*1 = 19 but it returns only 16 (for others it calculates wrong y too).
btw Compute Capability is 1.1 and cuda version 3.2.
I was wondering what is it that I’m doing wrong, since I can’t believe that the compiler produces incorrect code in this simple case.