Hi everyone.

I have, as input, 3 jagged 2D matrices (2 of doubles and 1 of __int64 values) and 2 square matrices of doubles (for my results). I noticed that with my given algorithm, that all the elements of a column of my resulting matrices require the reading of an entire row of my 3 input matrices. So, my goal is to use shared memory to tile each row of my input matrices by reading it into shared memory, then using that instead of reading from global memory multiple times. I illustrate it like so:

[codebox]

input

| SMEM TILE |

[>]

results

[ |]

[ |]

[ |]

[ |]

[ |]

[ |]

[ |]

[/]

[/codebox]

Given the nature of the algorithm, it is not that results[0][0] needs every element in the SMEM tile from all 3 input arrays, but depending on a bunch of branching conditions, it may or may not need certain values. I have an index value that keeps track, for the entirety of a block, how far a thread can read before it needs a new tile of updated information. I call __syncthreads() in this loadSMEM() method, and it is forced to wait until all threads of the block also get to a point where it needs a new tile of information.

Also, the threads never have to “go back” and read something much earlier in the row of the input matrix. It is always in the current tile and moving forward until the end of the row.

Does this situation make sense? I’m having a very difficult time actually getting this guy to give me the correct results. I will show you pieces of my code:

[codebox]

/// Kernel.cu ///

#define SMEM_LIMIT 512

**global** void kernel(double* d_values,double** d_valuesPtr,double* ntimes, double** ntimesPtr,__int64* d_times,__int64** d_timesPtr, more_params)

{

```
// this is the indexing scheme I need based on how I allocate blocks for the kernel
// ty is the row of the jagged array we are going across
int ty = blockIdx.x / numBlocksPerRow;
int tx = ((blockIdx.x - (ty * numBlocksPerRow))) * blockDim.x + threadIdx.x;
__shared__ double smemVALUES[SMEM_LIMIT];
__shared__ double smemNTIMES[SMEM_LIMIT];
__shared__ __int64 smemTIMES[SMEM_LIMIT];
__shared__ ulong idx[1];
idx[0] = 0;
// only need as many threads for a row as there are elements (1 thread per element of a row)
if (tx < numRows)
{
// d_cPtr & d_pPtr are the 2D arrays in the GPU that are the square matrices for results
double& val1 = d_cPtr[tx][ty];
double& val2 = d_pPtr[tx][ty];
// N is the length of the Nth jagged array row for the 3 input matrices (all 3 matrices have exact same dimensions)
// M is the length of the Mth jagged array row for the 3 input matrices
for (n = 0, m = 0; n < N; n++)
{
// 3 input arrays are values, ntimes, times (doubles, doubles, __int64 respectively)
// Call this L1
for ( ; some_conditions && getTimes(smemTIMES,idx[0],ty,M,m) == some_val; m++)
update_stuff_1; // calls getValues(stuff)
// Call this L2
for ( ; m < M && getTimes(smemTIMES,idx[0],ty,M,m) == some_val; m++)
update_stuff_2; // same
// Call this L3
for ( mm = m; mm < M && getTimes(smemTIMES,idx[0],ty,M,mm) == some_val; mm++)
update_stuff_3; // same
}
}
```

}

// note, inside of update_stuff_1, update_stuff_2, update_stuff_3, there are calls to a method called getValues() which grabs from the shared memory blocks I allocated called smemVALUES and smemNTIMES, which are both of type double, whereas smemTIMES is of type __int64.

**device** __int64& getTimes(double* smemTIMES,ulong& idx,const int& ty,const ulong& M,const ulong& m)

{

```
// idx is the location of the element in the jagged array where the SMEM tile stops at (has 1D tile of size SMEM_LIMIT)
// m is the current location of the jagged array this thread is looking at
// if m >= idx, then we are trying to get an element from SMEM that hasn't been tiled yet
if (m >= idx)
// need wrapper so that both getTimes() & getValues() both get to the same location in code
loadSMEMWrapper(similar_params);
// eg. for m = 512 on array d_timesPtr[ty][m], then smemTIMES[m % SMEM_LIMIT] = smemTIMES[512 % 512] = smemTIMES[0]
// this is how we make
return smemTIMES[m % SMEM_LIMIT];
```

}

**device** double& getValues(similar_params)

{

```
// same setup as getTimes()
...
if (type = "values")
return smemVALUES[m % SMEM_LIMIT];
else
return smemNTIMES[m % SMEM_LIMIT];
```

}

**device** loadSMEMWrapper(similar_params)

{

```
// loading a tile is a kind of race condition within the block--every threads has to "hang" at m >= idx in order to be ready to load a new tile
__syncthreads();
if (threadIdx.x == 0)
{
loadSMEM<double>(smemVALUES,valuesPtr,idx,ty,M,m);
loadSMEM<double>(smemNTIMES,ntimesPtr,idx,ty,M,m);
loadSMEM<__int64>(smemTIMES,timesPtr,idx,ty,M,m);
// after the 3 loadSMEM() calls, idx is either:
// idx = m + SMEM_LIMIT
// idx = M
// depending on which condition of the for loop inside loadSMEM() fails first
// in either case, there is no need to alter idx here, it is ready for next time loadSMEMWrapper() is called
}
// don't let anyone return a value until all of the 3 new tiles have been loaded
__syncthreads();
```

}

template

**device** void loadSMEM(T* smem,T** arrPtr,ulong& idx,const ulong& ty,const ulong& M,const ulong& m)

{

```
idx = m;
// only load tile information as long as we have room in the tile or until we run out of elements on ty
// recall, M is the length of row ty of jagged input arrays
for (ulong i = 0; i < SMEM_LIMIT && idx < M; i++)
smem[i] = arrPtr[ty][idx++];
```

}

[/codebox]

…whew. I’m omitting some detail so that this thing can partly be understood. I would greatly appreciate any kind of insight as to what might be going wrong here. Say I only put in the getTimes() method from L1 but leave everything else reading from global memory like normal, then when I run the program to see my results, the output of the 2 square matrices is a little off. Not all the values are wrong, and they might only be off by 0.01 or something like that. So, there is something small wrong happening here.

This is a dense and specific problem I am asking about. Thanks to those of you that actually try to get through it!

Daniel