Dealing with 2D data: understanding basic concepts

I’m building a CUDA kernel to compute the numerical N*N jacobian of a function, using finite differences; in the example I provided, it is the square function (each entry of the vector is squared). The host coded allocates in linear memory, while I’m using a 2-dimensional indexing in the kernel.

The question is: how to sum an element to the diagonal of the matrix? This question refers to this type of kernel calls

#define N 8
#define BLOCK_SIZE 16
#define NUM_BLOCKS ((N*N + BLOCK_SIZE - 1)/ BLOCK_SIZE)
...
dim3 dimGrid(NUM_BLOCKS, NUM_BLOCKS); 
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); 
jacobian_kernel <<< dimGrid, dimBlock >>> (...);
...

So that I split the function calls into blocks, instead of calling

jacobian_kernel <<< N, N >>> ();

because N is going to be large.

Here is the kernel and here you can find a minimal example of the code https://drive.google.com/open?id=0B3-mzH5aNz_FcG1nNzN3RENsRmc

template <typename T>
__global__ void jacobian_kernel (
                T * J,
                const T t0, 
                const T tn,
                const T h,
                const T * u0, 
                const T * un, 
                const T * un_old)
{
    T cgamma = 2 - sqrtf(2);
    const unsigned int t = threadIdx.x;
    const unsigned int b = blockIdx.x;
    const unsigned int tid = t + b * blockDim.x;
    /*__shared__*/ T temp_sx[BLOCK_SIZE][BLOCK_SIZE];
    /*__shared__*/ T temp_dx[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ T sm_temp_du[BLOCK_SIZE];

    T* temp_du = &sm_temp_du[0];

    if (tid < N )
    {
        temp_sx[b][t] = un[t]; 
        temp_dx[b][t] = un[t];

        if ( t == b )
        {
            if ( tn == t0 )
            {   
                temp_du[t] = u0[t]*0.001; 

                temp_sx[b][t] += temp_du[t]; //(*)
                temp_dx[b][t] -= temp_du[t];

                temp_sx[b][t] += ( abs( temp_sx[b][t] ) < 10e-6 ? 0.1 : 0 );
                temp_dx[b][t] += ( abs( temp_dx[b][t] ) < 10e-6 ? 0.1 : 0 );

                temp_sx[b][t] = ( temp_sx[b][t] == 0 ? 0.1 : temp_sx[b][t] );
                temp_dx[b][t] = ( temp_dx[b][t] == 0 ? 0.1 : temp_dx[b][t] );

            }

            else
            {
                temp_du[t] = MAX( un[t] - un_old[t], 10e-6 );
                temp_sx[b][t] += temp_du[t];
                temp_dx[b][t] -= temp_du[t];
            }
        }
        __syncthreads();

        //J = f(tn, un + du)
        d_func(tn, (temp_sx[b]), (temp_sx[b]), 1.f);
        d_func(tn, (temp_dx[b]), (temp_dx[b]), 1.f);

        __syncthreads();
        J[tid] = (temp_sx[b][t] - temp_dx[b][t]) * powf((2 * temp_du[t]), -1);

        //J[tid]*= - h*cgamma/2;
        //J[tid]+= ( t == b ? 1 : 0);
        //J[tid] = temp_J[tid];
    }
}

The general procedure for computing the jacobian is

  1. Copy un into every row of temp_sx and temp_dx
  2. Compute du as a 0.01 magnitude from u0
  3. Sum du to the diagonal of temp_sx, subtract du from the diagonal of temp_dx
  4. Compute the square function on each entry of temp_sx and temp_dx
  5. Subtract them and divide every entry by 2*du

[/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b]
This procedure can be summarized with (f( un + due_i ) - f( un - due_i ))/2*du, where e_i is the i-th canonical vector.

My problem is to sum du to the diagonal of the matrices of temp_sx and temp_dx like I tried in (*). How can I achieve that?

In my attempt in the kernel to sum on the diagonal I used

if(threadIdx.x == blockIdx.x){...}

Why isn’t this correct? It only evaluates true if they both are 0. Thus du[0] is the only numerical value and the matrix becomes nan. Note that this approach worked with the first code I built, where instead I called the kernel with <<<N,N>>>.

What statement should I put there that works even if the matrices are split into blocks and threads?

Let me know if I should add some other piece of info. Thank you

cross posting:

http://stackoverflow.com/questions/41764331/dealing-with-matrices-in-cuda-understanding-basic-concepts