 # Dealing with 2D data: understanding basic concepts

I’m building a CUDA kernel to compute the numerical N*N jacobian of a function, using finite differences; in the example I provided, it is the square function (each entry of the vector is squared). The host coded allocates in linear memory, while I’m using a 2-dimensional indexing in the kernel.

The question is: how to sum an element to the diagonal of the matrix? This question refers to this type of kernel calls

``````#define N 8
#define BLOCK_SIZE 16
#define NUM_BLOCKS ((N*N + BLOCK_SIZE - 1)/ BLOCK_SIZE)
...
dim3 dimGrid(NUM_BLOCKS, NUM_BLOCKS);
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
jacobian_kernel <<< dimGrid, dimBlock >>> (...);
...
``````

So that I split the function calls into blocks, instead of calling

``````jacobian_kernel <<< N, N >>> ();
``````

because N is going to be large.

Here is the kernel and here you can find a minimal example of the code https://drive.google.com/open?id=0B3-mzH5aNz_FcG1nNzN3RENsRmc

``````template <typename T>
__global__ void jacobian_kernel (
T * J,
const T t0,
const T tn,
const T h,
const T * u0,
const T * un,
const T * un_old)
{
T cgamma = 2 - sqrtf(2);
const unsigned int t = threadIdx.x;
const unsigned int b = blockIdx.x;
const unsigned int tid = t + b * blockDim.x;
/*__shared__*/ T temp_sx[BLOCK_SIZE][BLOCK_SIZE];
/*__shared__*/ T temp_dx[BLOCK_SIZE][BLOCK_SIZE];
__shared__ T sm_temp_du[BLOCK_SIZE];

T* temp_du = &sm_temp_du;

if (tid < N )
{
temp_sx[b][t] = un[t];
temp_dx[b][t] = un[t];

if ( t == b )
{
if ( tn == t0 )
{
temp_du[t] = u0[t]*0.001;

temp_sx[b][t] += temp_du[t]; //(*)
temp_dx[b][t] -= temp_du[t];

temp_sx[b][t] += ( abs( temp_sx[b][t] ) < 10e-6 ? 0.1 : 0 );
temp_dx[b][t] += ( abs( temp_dx[b][t] ) < 10e-6 ? 0.1 : 0 );

temp_sx[b][t] = ( temp_sx[b][t] == 0 ? 0.1 : temp_sx[b][t] );
temp_dx[b][t] = ( temp_dx[b][t] == 0 ? 0.1 : temp_dx[b][t] );

}

else
{
temp_du[t] = MAX( un[t] - un_old[t], 10e-6 );
temp_sx[b][t] += temp_du[t];
temp_dx[b][t] -= temp_du[t];
}
}

//J = f(tn, un + du)
d_func(tn, (temp_sx[b]), (temp_sx[b]), 1.f);
d_func(tn, (temp_dx[b]), (temp_dx[b]), 1.f);

J[tid] = (temp_sx[b][t] - temp_dx[b][t]) * powf((2 * temp_du[t]), -1);

//J[tid]*= - h*cgamma/2;
//J[tid]+= ( t == b ? 1 : 0);
//J[tid] = temp_J[tid];
}
}
``````

The general procedure for computing the jacobian is

1. Copy un into every row of temp_sx and temp_dx
2. Compute du as a 0.01 magnitude from u0
3. Sum du to the diagonal of temp_sx, subtract du from the diagonal of temp_dx
4. Compute the square function on each entry of temp_sx and temp_dx
5. Subtract them and divide every entry by 2*du

[/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b]
This procedure can be summarized with (f( un + due_i ) - f( un - due_i ))/2*du, where e_i is the i-th canonical vector.

My problem is to sum du to the diagonal of the matrices of temp_sx and temp_dx like I tried in (*). How can I achieve that?

In my attempt in the kernel to sum on the diagonal I used

``````if(threadIdx.x == blockIdx.x){...}
``````

Why isn’t this correct? It only evaluates true if they both are 0. Thus du is the only numerical value and the matrix becomes nan. Note that this approach worked with the first code I built, where instead I called the kernel with <<<N,N>>>.

What statement should I put there that works even if the matrices are split into blocks and threads?

Let me know if I should add some other piece of info. Thank you