I have big kernel in which in the end after performing lot of computations each thread gets a 6 by 6 matrix with it. ( this each thread matrix is in global memory as its matrix of doubles hence too large to fit in shared memory as ==> number_of_threads_per-block66*8/1024 > 16kb plus I have other stuff in my kernel in shared memory).
Now my goal in the end is to multiply each matrix in increasing thread order like :
FINAL 66 MATRIX = threadID(1)matirxthreadID(2)matrix…*threadID(N)matrix ; where N = number of total threads launched.
At first I thought to copy all the matrices back to the CPU and then do this orderd multiplication, but I feel I should take advantage of the threads which lie in same block to get some level of parallelism.
I am trying to find a way to multiply all the matrices in each thread block in order there so that each thread block will eventually have one matrix in the end which will be
= threadID(tid)matrix*threadID(tid+1) … threadID(tid + blocksize-1)matrix…
hence this way I can reduce the number of matrix being multiplied on the cpu.
But I don’t know how to do the ordered multiplication? One way maybe is to put __syncthreads and checking tid (causes lot of divergence) but am not being able to get it to work
Please I would really appreciate if anyone can help me this… am relatively new to programming in CUDA.
Thanks all