Any ideas on how to get rid of warp serialize / coalescing here (vector x matrix)?

Hi!

Im quite stuck with this kernel here. What it is doing is multiplying an array of n vectors (elements per vector k = 8) and a matrix ( 8 x 8 ) resulting in n vectors.

The problem is that I use one thread per vector and element in vector ( → blockDim.y = k, blockDim.x = 64, and enough blocks to work on the whole array), and this way, the threads accessing the elements of one vector do repeatedley read from one bank. So my problem is the “access pattern” i have to use.

Currently I dont know any way out of this, so i have a very high number of warp serialism ( :unsure: ) (1149, using 28 blocks, 512 threads each). Btw coalescing is also a problem here…

Just a note: this version is still faster than using registers instead of shared mem

Maybe anyone has a idea of some “pattern” to solve problems like this, when threads have to access the same memorybanks over and over again (i know of CUBLAS and see this as a necessary exercise for better understanding of those problems)…

This is how my kernel looks (not exactly but its sufficient to show the problem):

[codebox]global void multiplicationFun(float *vectors, float *matrix, float *results)

{

int value = 0;

__shared__ float vectorsShared[57][8];

__shared__ float matrixShared[64];

// store vectors in shared

vectorsShared[threadIdx.y][threadIdx.x] = vectors[14336 + threadIdx.y * 8 + threadIdx.x];

// number of elements < number of threads, so just part of them have to store matrix in shared

if(row == 0)

{

	matrixShared[threadIdx.y] = matrix[threadIdx.y];

}

__syncthreads();

// each thread multiplies one vector with one column of matrix

for(int i = 0; i < 8; i++)

{

	value += vectorsShared[threadIdx.y][i] * matrixShared[threadIdx.x + i * 8];

}

results[14336 + threadIdx.y * blockDim.x + threadIdx.x] = value;

}[/codebox]

I have several comments under thread block = {blockDim.y = 64, blockDim.x = 8}

known fact: one transaction of shared memory is half-warp.

let thread has configuration (threadIdx.x, threadIdx.y),

first half warp is

(0,0) (1,0) (2,0) (3,0) (4,0) (5,0) (6,0) (7,0)

(0,1) (1,1) (2,1) (3,1) (4,1) (5,1) (6,1) (7,1)

in code block

for(int i = 0; i < 8; i++)	{		

	value += vectorsShared[threadIdx.y][i] * matrixShared[threadIdx.x + i * 8];	

}

fixed index “i”,

(1) vectorsShared[threadIdx.y][i] is 8-way bank conflict since

{(0,0) (1,0) (2,0) (3,0) (4,0) (5,0) (6,0) (7,0)} access vectorsShared[0][i]

{(0,1) (1,1) (2,1) (3,1) (4,1) (5,1) (6,1) (7,1)} access vectorsShared[1][i]

however due to broadcast mechanism, only two transactions are issued.

It is equivalent to 2-way conflict

(2) matrixShared[threadIdx.x + i * 8] is 2-way bank conflict since

{(0,0), (0,1)} access matrixShared[0 + i * 8]

{(1,0), (1,1)} access matrixShared[1 + i * 8]

{(7,0), (7,1)} access matrixShared[7 + i * 8]

I think that broadcast mechanism would be disabled here and only two transactions are needed

(broadcast mechanism would issue 8 transactions, not good)

To sum up, your algorithm has 2-way bank-conflict.

under such observation, I will sugggest that

change code block

if(row == 0)	{		

	matrixShared[threadIdx.y] = matrix[threadIdx.y];	

}

to

int gid = threadIdx.y * 8 + threadIdx.x;

if ( gid < 64 ){

	matrixShared[ gid ] = matrix[ gid ];

}