Hello,
The following kernel handles a 2D float matrix in global memory.
A row is consecutive in global memory.
Each row is multiplied by a constant vector. This is a sample by sample multiply.
The result is put back in global memory.
In order to improve performance I tried to declare the constant vector in shared memory.
I declared a 4096 length vector which is the maximum allowed.
But it seems there is no improvement.
I ran the code on TX2.
Does it make sense ?
In NPP I did not find a kernel that does the same calculation
Thank you,
Zvika
/**************************************************************************************/
__global__ void mat_multiply_row_vector_kernel (float *pSrcDest, float *pScalar, int nx, int ny)
{
unsigned int ix = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int iy = threadIdx.y + blockIdx.y * blockDim.y;
unsigned int idx = iy*nx + ix;
if (ix>=nx || iy>=ny)
return;
float src = pSrcDest[idx];
#ifdef SHARED
__shared__ float vec[4096];
float val = vec[ix];
#else
float val = pScalar[ix];
#endif
float dest;
dest = src * val;
pSrcDest[idx] = dest;
}