Matrices, Tiling & Shared Memory Woes

Hi everyone.

I have, as input, 3 jagged 2D matrices (2 of doubles and 1 of __int64 values) and 2 square matrices of doubles (for my results). I noticed that with my given algorithm, that all the elements of a column of my resulting matrices require the reading of an entire row of my 3 input matrices. So, my goal is to use shared memory to tile each row of my input matrices by reading it into shared memory, then using that instead of reading from global memory multiple times. I illustrate it like so:






[ |]

[ |]

[ |]

[ |]

[ |]

[ |]

[ |]



Given the nature of the algorithm, it is not that results[0][0] needs every element in the SMEM tile from all 3 input arrays, but depending on a bunch of branching conditions, it may or may not need certain values. I have an index value that keeps track, for the entirety of a block, how far a thread can read before it needs a new tile of updated information. I call __syncthreads() in this loadSMEM() method, and it is forced to wait until all threads of the block also get to a point where it needs a new tile of information.

Also, the threads never have to “go back” and read something much earlier in the row of the input matrix. It is always in the current tile and moving forward until the end of the row.

Does this situation make sense? I’m having a very difficult time actually getting this guy to give me the correct results. I will show you pieces of my code:


/// ///

#define SMEM_LIMIT 512

global void kernel(double* d_values,double** d_valuesPtr,double* ntimes, double** ntimesPtr,__int64* d_times,__int64** d_timesPtr, more_params)


// this is the indexing scheme I need based on how I allocate blocks for the kernel

// ty is the row of the jagged array we are going across

int ty = blockIdx.x / numBlocksPerRow;

int tx = ((blockIdx.x - (ty * numBlocksPerRow))) * blockDim.x + threadIdx.x;

__shared__ double  smemVALUES[SMEM_LIMIT];

__shared__ double  smemNTIMES[SMEM_LIMIT];

__shared__ __int64 smemTIMES[SMEM_LIMIT];

__shared__ ulong   idx[1];

idx[0] = 0;

// only need as many threads for a row as there are elements (1 thread per element of a row)

if (tx < numRows)


	// d_cPtr & d_pPtr are the 2D arrays in the GPU that are the square matrices for results

	double& val1 = d_cPtr[tx][ty];

	double& val2 = d_pPtr[tx][ty];

	// N is the length of the Nth jagged array row for the 3 input matrices (all 3 matrices have exact same dimensions)

	// M is the length of the Mth jagged array row for the 3 input matrices

	for (n = 0, m = 0; n < N; n++)


		// 3 input arrays are values, ntimes, times (doubles, doubles, __int64 respectively)

		// Call this L1

		for ( ; some_conditions && getTimes(smemTIMES,idx[0],ty,M,m) == some_val; m++)

			update_stuff_1; // calls getValues(stuff)

		// Call this L2

		for ( ; m < M && getTimes(smemTIMES,idx[0],ty,M,m) == some_val; m++)

			update_stuff_2; // same


		// Call this L3

		for ( mm = m; mm < M && getTimes(smemTIMES,idx[0],ty,M,mm) == some_val; mm++)

			update_stuff_3; // same




// note, inside of update_stuff_1, update_stuff_2, update_stuff_3, there are calls to a method called getValues() which grabs from the shared memory blocks I allocated called smemVALUES and smemNTIMES, which are both of type double, whereas smemTIMES is of type __int64.

device __int64& getTimes(double* smemTIMES,ulong& idx,const int& ty,const ulong& M,const ulong& m)


// idx is the location of the element in the jagged array where the SMEM tile stops at (has 1D tile of size SMEM_LIMIT)

// m is the current location of the jagged array this thread is looking at

// if m >= idx, then we are trying to get an element from SMEM that hasn't been tiled yet

if (m >= idx)

	// need wrapper so that both getTimes() & getValues() both get to the same location in code


// eg. for m = 512 on array d_timesPtr[ty][m], then smemTIMES[m % SMEM_LIMIT] = smemTIMES[512 % 512] = smemTIMES[0]

// this is how we make 

return smemTIMES[m % SMEM_LIMIT];


device double& getValues(similar_params)


// same setup as getTimes()


if (type = "values")

	return smemVALUES[m % SMEM_LIMIT];


	return smemNTIMES[m % SMEM_LIMIT];


device loadSMEMWrapper(similar_params)


// loading a tile is a kind of race condition within the block--every threads has to "hang" at m >= idx in order to be ready to load a new tile


if (threadIdx.x == 0)





	// after the 3 loadSMEM() calls, idx is either:

	// idx = m + SMEM_LIMIT

	// idx = M

	// depending on which condition of the for loop inside loadSMEM() fails first

	// in either case, there is no need to alter idx here, it is ready for next time loadSMEMWrapper() is called


// don't let anyone return a value until all of the 3 new tiles have been loaded




device void loadSMEM(T* smem,T** arrPtr,ulong& idx,const ulong& ty,const ulong& M,const ulong& m)


idx = m;

// only load tile information as long as we have room in the tile or until we run out of elements on ty

// recall, M is the length of row ty of jagged input arrays

for (ulong i = 0; i < SMEM_LIMIT && idx < M; i++)

	smem[i] = arrPtr[ty][idx++];



…whew. I’m omitting some detail so that this thing can partly be understood. I would greatly appreciate any kind of insight as to what might be going wrong here. Say I only put in the getTimes() method from L1 but leave everything else reading from global memory like normal, then when I run the program to see my results, the output of the 2 square matrices is a little off. Not all the values are wrong, and they might only be off by 0.01 or something like that. So, there is something small wrong happening here.

This is a dense and specific problem I am asking about. Thanks to those of you that actually try to get through it!


I haven’t got the time to understand the code. But I notice that loadSMEMWrapper() has a __syncthreads() and is conditionally called which is a no-no.


I see what you mean. My problem was that I thought each thread would access the 2D jagged array in the same way. If that were the case, then each thread would follow the if (m >= idx) condition and block at the __syncthreads() call as I was expecting it to. As it stands, I just learned that my underlying assumption of accessing the jagged array was false. Otherwise, I think my idea would have worked.