2D float matrix x vector: global vs. shared memory:


The following kernel handles a 2D float matrix in global memory.
A row is consecutive in global memory.

Each row is multiplied by a constant vector. This is a sample by sample multiply.
The result is put back in global memory.

In order to improve performance I tried to declare the constant vector in shared memory.
I declared a 4096 length vector which is the maximum allowed.

But it seems there is no improvement.
I ran the code on TX2.

Does it make sense ?
In NPP I did not find a kernel that does the same calculation

Thank you,

__global__ void mat_multiply_row_vector_kernel (float *pSrcDest, float *pScalar, int nx, int ny)
	unsigned int ix = threadIdx.x + blockIdx.x * blockDim.x;
	unsigned int iy = threadIdx.y + blockIdx.y * blockDim.y;
	unsigned int idx = iy*nx + ix;
	if (ix>=nx || iy>=ny)

	float src = pSrcDest[idx];
#ifdef SHARED
	__shared__ float vec[4096];
	float val = vec[ix];
	float val = pScalar[ix];
	float dest;

	dest = src * val;
	pSrcDest[idx] = dest;


I see you declared the shared memory space but didn’t actually load anything into it.
See line 13, where you have the declaration, and then line 14, where you load into val whatever is in vec[ix]. But… what is in vec[ix]?!

But let’s assume you loaded something into it, respecting the architecture of 32 banks/words of 4 or 8 bytes (which you can configure), and avoiding that threads of the same warp access different words of the same bank (bank conflict, causes the accesses to be serialized), the use of shared memory will bring benefit if, for example, you need to load an element from global memory and access it multiple times. Read here for a more elaborated answer: https://stackoverflow.com/questions/8011376/when-is-cudas-shared-memory-useful

So, as nice as it is to have a user-programmable cache, not all problems benefit from it.