Hi!
Im quite stuck with this kernel here. What it is doing is multiplying an array of n vectors (elements per vector k = 8) and a matrix ( 8 x 8 ) resulting in n vectors.
The problem is that I use one thread per vector and element in vector ( → blockDim.y = k, blockDim.x = 64, and enough blocks to work on the whole array), and this way, the threads accessing the elements of one vector do repeatedley read from one bank. So my problem is the “access pattern” i have to use.
Currently I dont know any way out of this, so i have a very high number of warp serialism ( :unsure: ) (1149, using 28 blocks, 512 threads each). Btw coalescing is also a problem here…
Just a note: this version is still faster than using registers instead of shared mem
Maybe anyone has a idea of some “pattern” to solve problems like this, when threads have to access the same memorybanks over and over again (i know of CUBLAS and see this as a necessary exercise for better understanding of those problems)…
This is how my kernel looks (not exactly but its sufficient to show the problem):
[codebox]global void multiplicationFun(float *vectors, float *matrix, float *results)
{
int value = 0;
__shared__ float vectorsShared[57][8];
__shared__ float matrixShared[64];
// store vectors in shared
vectorsShared[threadIdx.y][threadIdx.x] = vectors[14336 + threadIdx.y * 8 + threadIdx.x];
// number of elements < number of threads, so just part of them have to store matrix in shared
if(row == 0)
{
matrixShared[threadIdx.y] = matrix[threadIdx.y];
}
__syncthreads();
// each thread multiplies one vector with one column of matrix
for(int i = 0; i < 8; i++)
{
value += vectorsShared[threadIdx.y][i] * matrixShared[threadIdx.x + i * 8];
}
results[14336 + threadIdx.y * blockDim.x + threadIdx.x] = value;
}[/codebox]