Ordered Multiplication of various matrices in shared memory greater minds please help

I have big kernel in which in the end after performing lot of computations each thread gets a 6 by 6 matrix with it. ( this each thread matrix is in global memory as its matrix of doubles hence too large to fit in shared memory as ==> number_of_threads_per-block66*8/1024 > 16kb plus I have other stuff in my kernel in shared memory).

Now my goal in the end is to multiply each matrix in increasing thread order like :
FINAL 66 MATRIX = threadID(1)matirxthreadID(2)matrix…*threadID(N)matrix ; where N = number of total threads launched.

At first I thought to copy all the matrices back to the CPU and then do this orderd multiplication, but I feel I should take advantage of the threads which lie in same block to get some level of parallelism.

I am trying to find a way to multiply all the matrices in each thread block in order there so that each thread block will eventually have one matrix in the end which will be
= threadID(tid)matrix*threadID(tid+1) … threadID(tid + blocksize-1)matrix…
hence this way I can reduce the number of matrix being multiplied on the cpu.

But I don’t know how to do the ordered multiplication? One way maybe is to put __syncthreads and checking tid (causes lot of divergence) but am not being able to get it to work

Please I would really appreciate if anyone can help me this… am relatively new to programming in CUDA.

Thanks all

Please anybody there who can help me ? :(

Also one more question is there a way to make the writes to each 6by6 per thrad matrix of doubles coalesced ?

Well, you can recursively perform multiplication, i.e. M = M1M2M3M4Mn = (M1M2)(M3M4)(Mn-1 * Mn).

You also don’t need a lot of shared mem to do it, since all you need is to transfer just 1 float when dot product of row by column to compute ij element of a matrix.

Thanks for your quick response. :)

Please can you give me a small example of what you just stated , like take for instance we have 3 threads in a block each with its own 6 by 6 matirx so how would you go about multiplying the matrices in order ( matrix1*matrix2 *matrix3) … (am sorry I am not accustomed with this much hence am not following you exactly).

For my computations its very imp that they matrices are multiplied in order.

Also from recursive you did not meant recursion did u ? cause in CUDA I think (maybe am wrng) does not support recursion.

Again thanks for the help :)

Matrix multiplication is associative, (AB)C = A(BC), so you can apply recursion to it (I don’t mean recursion function calls).

Let’s imagine you have 8 threads, each has 1 matrix, and you need to multiply all of them M=M1M2M3M4M5M6M7*M8.

Now, applying recursion each thread will compute the following:

1st iteration

thread1: M12 = M1M2; thread2: M34 = M3M4; thread3: M56=M5M6; thread4: M78=M7M8; other threads don’t need to do anything

2nd iteration:

thread1: M1234 = M12 * M34; thread2: M5678=M56 * M78;

3rd iteration

thread1: M = M1234 * M5678

in the code it’d look something like this:


const int NumThreads = 8;

shared float col[NumThread/2][6] // will hold column from another thread

float mat[6][6]; // original matrix

float tmp[6][6];

for (int i = NumThreads/2 ; i > 0; i /= 2)


for (int m  = 0; m < 6; ++m)


  // get m'th column from another thread

  if (threadIdx.x >= i && threadIdx.x < i*2)


    for (int k = 0; k < 6; ++k)

       col[threadIdx.x - i][k] = mat[k][m];



// perform multiplication

  if (threadIdx.x < i)


    for (int n = 0; n < 6; ++n)


      tmp[n][m]  = 0

      for (int k = 0; k < 6; ++k)

        tmp[n][m] += mat[n][k] * col[threadIdx.x][k];



if (i > 1)

     __syncthreads();  // sync threads  before next iteration


memcpy(mat, tmp, 4 * 6 * 6);


if (threadIdx.x == 0)


pOut[threadIdx.x] = mat;



This code is probably full of bugs, but you might get the idea. I also assumed row-major matrix storage.

Great… thanks for the super fast and informative reply :) … and congrats on ur 50th post… you have four boxes now ;)

btw, changing
shared float col[NumThread/2][6]
shared float col[6][NumThread/2]
will avoid bankconflicts (which are there in the original code I wrote)

I am unsure about that as I will be using doubles(double precision) not floats, as in your code . But then I should only have 2 way bank conflicts (due to 64 bit nature of each double requiring 2 banks) with your approach I guess.

also can we use memcpy command inside the kernel ? as you suggest in your code.

memcpy(mat, tmp, 4 * 6 * 6);

I thought you cannot do any memory operations inside the kernel.

Compiler should unroll memcpy into movs
and you could use __double2hiint __double2loint __hiloint2double to pack/unpack doubles, so that you could still do a column transfer via shared memory without bank conflicts

Thanks :) …
I know about that fix and I have done that before it doesn’t buy you much if you have lot of computations/instruction or per thread (as in my case). So I guess , I will try without unpacking doubles first and then see the performance.
Anyone has any past experience with unpacking doubles were he saw noticeable increase in performance ?