Yes it finally is working though not very good… just achieved 12 Gflops in double precision with 6x speed up.
No as there is no inter-thread data communication hence I don’t use sync threads. Thanks for the input :) .
Yup that helped actually… the bug was in my code… Unallocated shared memory access inside a loop. Lame me… I was hoping for more dramatic performance, I found that I am loosing lot of performance while accessing (read and write) device memory. I have to access 546 + 42 elements from device memory 12 times PER THREAD :( for my current algorithm… I guess that is what is killing my application speed even , though I have like (PER THREAD) :
12242*13 (42 by13 mat-vec product done one column after another 12 times)
1166*6( 6by 6 mat-mul 11 times)
flops / thread…
THAT’S A LOT OF FLOPS I KNOW , therefore I thought I should get lot of speed over cpu, but it also requires lot of data trasnfer/thread.
I guess I have to more fine grain the parallelism so that data transfer is less.
How much is the kernel launch overhead :unsure: ?
I will try multiple kernel launches to achieve this somehow.