the reduction probably isn’t working, stride will never be <0, so the for-loop will never execute. And you need the syncthreads() inside the loop, all accumResult need to be updated before you can do the next iteration
you can make sA and sB regular variables, they don’t have to be shared memory arrays for your implementation
the last syncthreads() (on line 33) can be removed
Thanks for the reply. Actually, I declared sA, sB, and accumResult as shared memory arrays to accelerate the kernel execution time (10 times faster). Shared memory allow the 128 threads of a block to access data very quickly.
You’re right, the syncthread() should be inside the for loop.
I changed the for stop condition to stride < 1, and removed the last useless syncthreads().
The program is still giving me the same bad distance results.
But I still think sA and sB don’t have to be in shared memory. Every element, indexed by ty, is only used by one thread. The thread which is writing the element at ty is also reading it, but no other threads are using this element. I would change lines 15 and 16 to:
float sA = A [bx * SIZE + ty];
float sB = B [by * SIZE + ty];
I guess this will even be a little bit faster, as you save some writes and reads to shared memory
When I remove sA and sB from shared memory, the results are no longer the same, is that normal? Moreover, it’s a bit slower. I’ve jut removed the shared declaration from line 4, 5 and 6.
sA and sB should not be arrays, just regular variables as in my previous post. In the rest of the code you should replace “sA[ty]” by sA and “sB[ty]” by “sB”.