Multiplying Rectangular Matrices


I am doing a computation which requires the multiplication of two rectangular matrices, in fact a matrix and its transpose… (3xn)*(nx3). If I have a block size of 40X3 my threadIdx.x say tx would go from 0-40 and threadIdx.y say ty from 0-3. While multiplying these matrices I would write the code as

(int k = 0; k < blockDim.x; ++k)

  Csub += A(k,ty)* At(tx,k);


The result obviously is a 3x3. But the problem with the code above is the index tx would go beyond 3 which would result in unnecessary computation. This is proving to be computationally expensive in my implementation.

Would it be possible to use syncthreads() and still stop the index from getting incremented beyond 3? Thanks in advance for any help.


First, your tx and ty should be indices into the product matrix. Thus, in your case, both should vary between 0 and 2, inclusively (since your output is 3x3).

Second, such parallelization will be very inefficient since only 9 threads will be active. That’s not enough to fill up even a single threadblock. You should also parallelize the inner loop (the one that’s indexed by k). Think of it this way - you will have to compute a total of 3x3x40 scalar products and same number of adds. All the products can be computed in parallel since there are no dependencies (you may have to consider shared memory banking, though). After that, you’re left with 3x3 parallel sums computations, which you can do efficiently along the lines of prefix-sums/scan implementation lines.

Still, your input size is probably still too small to get a significant speedup out of the hardware. However, if this is part of a much larger CUDA computation, the goal then becomes avoiding any syncing with CPU, which will help.