crazy finite difference

Hi

I need to do the porting from ANSI C to CUDA of a finite difference function. I have no idea why I am not even able to run the same code without parallelization. Can anyone help me? in some case I experienced that the result of the finite difference was zero (something like sum of (0.5 * 0.9) = 0.0) which does not make any sense.

This is the code of the C function which is perfectly working:

void DiffX_CPU(float* U, float* Ux, int N, int alpha, float* stencils, int rank) 

// CPU routine to compute spatial derivative in one space dimension.

{

   // index in output vector Ux to be computed

   int row;

   for (row=0; row<N; ++row)

   { 

     float value=0.0;

     // Compute dot-product between FDM stencil weights and input vector U

     int diff = 0; // diff is used for automatically taking one-sided difference near boundaries

     if (row<alpha)

        diff = alpha - row;

     else if (row>N-1-alpha)  // row  >   Nx-3 Nx-2 Nx-1

        diff = N-1-alpha-row;

     int tmp = (alpha-diff)*rank+alpha;

     int tmp2 = row + diff;

int i;

     for (i = -alpha; i<alpha+1; ++i)

          value += U[tmp2+i]*stencils[tmp+i];

    // Store computed approximation

     Ux[row] =   value; 

   }

}

And this is the code without the parallalization in CUDA:

__global__ void DiffX_GPU(float* d_U, float* d_Ux, int N, int alpha, float* d_stencils, int rank) 

{

 // indices

 // const int b_i =   threadIdx.x;

 // const int b_j = blockIdx.y*blockDim.y + threadIdx.y;

 // const int n = b_i * N + b_j;

 int row;

   for (row=0; row<N; ++row)

   { 

     float value=0.0;

     // Compute dot-product between FDM stencil weights and input vector U

     int diff = 0; // diff is used for automatically taking one-sided difference near boundaries

     if (row<alpha)

        diff = alpha - row;

     else if (row>N-1-alpha)  // row  >   Nx-3 Nx-2 Nx-1

        diff = N-1-alpha-row;

     int tmp = (alpha-diff)*rank+alpha;

     int tmp2 = row + diff;

     int i;

     for (i = -alpha; i<alpha+1; ++i)

          value += (d_U[tmp2+i]*d_stencils[tmp+i])  ;

     // Store computed approximation

     d_Ux[row] =   value; 

   }

}

int main...... 

...

  // Allocate space on device

  float *d_U, *d_Ux, *d_stencils;

  cudaMalloc ((void**) &d_U, Nx*sizeof(float)); // TODO: Error checking

  cudaMalloc ((void**) &d_Ux, Nx*sizeof(float)); // TODO: Error checking

  cudaMalloc ((void**) &d_stencils, rank*sizeof(float)); // TODO: Error checking

  // Copy data to device

  cudaMemcpy (d_U, U, Nx*sizeof(float), cudaMemcpyHostToDevice);

  cudaMemcpy (d_stencils, stencils, rank*sizeof(float), cudaMemcpyHostToDevice);

  // Blocks and grid galore

  dim3 dimThreadBlock (BLOCK_SIZE_X, BLOCK_SIZE_Y);

  dim3 dimBlockGrid (Nx/BLOCK_SIZE_X,1);

DiffX_GPU <<< dimBlockGrid, dimThreadBlock >>> (d_U, d_Ux,  Nx, alpha, d_stencils, rank) ;

  cudaThreadSynchronize();

  // Copy result to host

  cudaMemcpy (Ux, d_Ux, Nx*sizeof(float), cudaMemcpyDeviceToHost);

...

Launch with <<<1, 1>>> to see if the CUDA code works…

Thanks for your reply. Unfortunately “no”. The result is always zero, no matter what parameter… it is clearly some

sort of overflow, I have no idea where.

Can the problem be based on the fact that I am using cudaMalloc only for the matrices and not for the integer I am passing to the device-function ?

First thing you should do is on your TODO list: check the error code returned by every CUDA function. At a minimum, use the macro trick defined in cutil:

#  define CUDA_SAFE_CALL( call) {                                    \

    cudaError err = call;                                                    \

    if( cudaSuccess != err) {                                                \

        fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",        \

                __FILE__, __LINE__, cudaGetErrorString( err) );              \

        exit(EXIT_FAILURE);                                                  \

    } }

Then put each of your CUDA function call statements (not the kernel launch) inside a CUDA_SAFE_CALL(). This will at least make it more clear when the problem occurs.

No, pass by value does not require cudaMalloc, so the problem has to be elsewhere.