Dealing with 2D data: understanding basic concepts

I’m building a CUDA kernel to compute the numerical N*N jacobian of a function, using finite differences; in the example I provided, it is the square function (each entry of the vector is squared). The host coded allocates in linear memory, while I’m using a 2-dimensional indexing in the kernel.

The question is: how to sum an element to the diagonal of the matrix? This question refers to this type of kernel calls

``````#define N 8
#define BLOCK_SIZE 16
#define NUM_BLOCKS ((N*N + BLOCK_SIZE - 1)/ BLOCK_SIZE)
...
dim3 dimGrid(NUM_BLOCKS, NUM_BLOCKS);
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
jacobian_kernel <<< dimGrid, dimBlock >>> (...);
...
``````

So that I split the function calls into blocks, instead of calling

``````jacobian_kernel <<< N, N >>> ();
``````

because N is going to be large.

Here is the kernel and here you can find a minimal example of the code https://drive.google.com/open?id=0B3-mzH5aNz_FcG1nNzN3RENsRmc

``````template <typename T>
__global__ void jacobian_kernel (
T * J,
const T t0,
const T tn,
const T h,
const T * u0,
const T * un,
const T * un_old)
{
T cgamma = 2 - sqrtf(2);
const unsigned int t = threadIdx.x;
const unsigned int b = blockIdx.x;
const unsigned int tid = t + b * blockDim.x;
/*__shared__*/ T temp_sx[BLOCK_SIZE][BLOCK_SIZE];
/*__shared__*/ T temp_dx[BLOCK_SIZE][BLOCK_SIZE];
__shared__ T sm_temp_du[BLOCK_SIZE];

T* temp_du = &sm_temp_du[0];

if (tid < N )
{
temp_sx[b][t] = un[t];
temp_dx[b][t] = un[t];

if ( t == b )
{
if ( tn == t0 )
{
temp_du[t] = u0[t]*0.001;

temp_sx[b][t] += temp_du[t]; //(*)
temp_dx[b][t] -= temp_du[t];

temp_sx[b][t] += ( abs( temp_sx[b][t] ) < 10e-6 ? 0.1 : 0 );
temp_dx[b][t] += ( abs( temp_dx[b][t] ) < 10e-6 ? 0.1 : 0 );

temp_sx[b][t] = ( temp_sx[b][t] == 0 ? 0.1 : temp_sx[b][t] );
temp_dx[b][t] = ( temp_dx[b][t] == 0 ? 0.1 : temp_dx[b][t] );

}

else
{
temp_du[t] = MAX( un[t] - un_old[t], 10e-6 );
temp_sx[b][t] += temp_du[t];
temp_dx[b][t] -= temp_du[t];
}
}

//J = f(tn, un + du)
d_func(tn, (temp_sx[b]), (temp_sx[b]), 1.f);
d_func(tn, (temp_dx[b]), (temp_dx[b]), 1.f);

J[tid] = (temp_sx[b][t] - temp_dx[b][t]) * powf((2 * temp_du[t]), -1);

//J[tid]*= - h*cgamma/2;
//J[tid]+= ( t == b ? 1 : 0);
//J[tid] = temp_J[tid];
}
}
``````

The general procedure for computing the jacobian is

1. Copy un into every row of temp_sx and temp_dx
2. Compute du as a 0.01 magnitude from u0
3. Sum du to the diagonal of temp_sx, subtract du from the diagonal of temp_dx
4. Compute the square function on each entry of temp_sx and temp_dx
5. Subtract them and divide every entry by 2*du

[/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b][/b]
This procedure can be summarized with (f( un + due_i ) - f( un - due_i ))/2*du, where e_i is the i-th canonical vector.

My problem is to sum du to the diagonal of the matrices of temp_sx and temp_dx like I tried in (*). How can I achieve that?

In my attempt in the kernel to sum on the diagonal I used

``````if(threadIdx.x == blockIdx.x){...}
``````

Why isn’t this correct? It only evaluates true if they both are 0. Thus du[0] is the only numerical value and the matrix becomes nan. Note that this approach worked with the first code I built, where instead I called the kernel with <<<N,N>>>.

What statement should I put there that works even if the matrices are split into blocks and threads?

Let me know if I should add some other piece of info. Thank you