I have big kernel in which in the end after performing lot of computations each thread gets a 6 by 6 matrix with it. ( this each thread matrix is in global memory as its matrix of doubles hence too large to fit in shared memory as ==> number_of_threads_per-block*6*6*8/1024 > 16kb plus I have other stuff in my kernel in shared memory).

Now my goal in the end is to multiply each matrix in increasing thread order like :

FINAL 6*6 MATRIX = threadID(1)matirx*threadID(2)matrix…*threadID(N)matrix ; where N = number of total threads launched.

At first I thought to copy all the matrices back to the CPU and then do this orderd multiplication, but I feel I should take advantage of the threads which lie in same block to get some level of parallelism.

I am trying to find a way to multiply all the matrices in each thread block in order there so that each thread block will eventually have one matrix in the end which will be

= threadID(tid)matrix*threadID(tid+1) … threadID(tid + blocksize-1)matrix…

hence this way I can reduce the number of matrix being multiplied on the cpu.

But I don’t know how to do the ordered multiplication? One way maybe is to put __syncthreads and checking tid (causes lot of divergence) but am not being able to get it to work

Please I would really appreciate if anyone can help me this… am relatively new to programming in CUDA.

Thanks all