Some help needed with shared memory and program correctness matrix * vector operation

What I have set-up this code to do (works as intended with SHARED_MEM == 0, using just global memory)

is to take a matrix and divide it into blocks of rows. Each block of the matrix is assigned to a CUDA Thread Block

and each row of that block is assigned to a thread of the Thread Block.

Shared memory is set-up to store a block of rows of that matrix and each Thread Block should have its own shared memory pool

(Thread Blocks do not share shared memory).

The code produces corrects results with shared memory “turned on” up until element #17 (TBLOCK = 16, ROWS = 2 * 16, COLS = 2 * 16).

(2 Thread Blocks with 16 threads each)

(source matrix is 32x32, source vector is 32x1, result vector is 32x1)

Here is the code:

__global__ void MatTest(

	float *d_C,

	float *d_A,

	float *d_B


	// Block index

	int bx = blockIdx.x;

	// Thread index

	int tx = threadIdx.x;

	int i = tx + (bx* blockDim.x);

	//printf ("tx: %d\n", tx);

	//printf ("bx: %d\n", bx);

	//printf ("i: %d\n", i);

#if SHARED_MEM == 1

	__shared__ float matA[TBLOCK * COLS];

	//__shared__ float vecB[ROWS];

	//__shared__ float vecC[ROWS];

	int c;

	for (c = 0; c < COLS; c++) {

		matA[indexC(tx,c, COLS)] = d_A[indexC(i,c,COLS)];

		//vecB[c] = d_B[c];

		//vecC[c] = 0;



	float t = 0;


	for (c = 0; c < COLS; c++) {


		t += matA[indexC(tx,c,COLS)] * d_B[c];


		//d_C[i] = 2; 





	d_C[i] = t;


	//this kernel uses  bytes of Shared Memory and  bit registers...


#if SHARED_MEM == 0

	for (int c = 0; c < COLS; c++) {

		d_C[i] += d_A[indexC(i,c,COLS)] * d_B[c];

		//this kernel uses  bytes of Shared Memory and  bit registers...



	//printf("d_C[%d]: %f\n", i, d_C[i]);


I have been thinking over it quite a bit, but I cannot find any obvious error so I’d gladly take some advice…

thanks in advance,



I am an idiot :P.

Executing in emudebug mode helped see an out of boundaries access issue that in turn helped me see that the macro indexC suggested the wrong argument (NCOLS instead of NROWS):

(corrected MACRO’s)

#define indexR(i, j, n_cols) ((j) + ((i) * (n_cols))) //row-major matrix
#define indexC(i ,j, n_rows) (((j) * (n_rows)) + (i)) //column major order addressing + 1st element has id#0

so instead of this:

matA[indexC(tx,c, TBLOCK)]…

I was doing…

matA[indexC(tx,c, NCOLS)] …

in the two loops.

Also… using COLS instead of ROWS was not giving a problem only because the matrix is square…
nVIDIA, thanks for the emudebug mode :D.

Sorry for wasting your time with a dumb error :(…