I am doing a computation which requires the multiplication of two rectangular matrices, in fact a matrix and its transpose… (3xn)*(nx3). If I have a block size of 40X3 my threadIdx.x say tx would go from 0-40 and threadIdx.y say ty from 0-3. While multiplying these matrices I would write the code as
(int k = 0; k < blockDim.x; ++k) Csub += A(k,ty)* At(tx,k); __syncthreads();
The result obviously is a 3x3. But the problem with the code above is the index tx would go beyond 3 which would result in unnecessary computation. This is proving to be computationally expensive in my implementation.
Would it be possible to use syncthreads() and still stop the index from getting incremented beyond 3? Thanks in advance for any help.