You have a coding error here:
sum = d_A[i*m+k] * d_B[k*n+j];
it should be:
sum += d_A[i*m+k] * d_B[k*n+j];
For the matrix-multiply kernel design you have shown, a couple comments come to mind:
- That is not a high-performance matrix multiply. If you care about performance, in my opinion you are starting in the wrong place. I mention this because without specific reasons to do otherwise, most likely the straightforward “relatively high performance” path here is probably not to write your own code but use a library (for both matrix ops) like CUBLAS.
- Focusing on the question, for learning purposes, your matrix-multiply kernel “expects” i.e. requires a thread array that is at least the size of B matrix. However the matrix-vector product doesn’t necessarily require that. So you have implementation choices. Do you want to use more threads, and potentially have higher performance, but also more code complexity, or do you want to use fewer threads (e.g. one thread per output point; a typical CUDA thread strategy) with potentially lower performance, but with lower code complexity? Choosing fewer threads in a unified kernel would immediately also raise the question about what threads to use or what the remaining threads would do (probably: nothing).
Let’s suppose you wanted higher performance (and perhaps code complexity), and use all the provisioned threads (all the threads that were used by the preceding matrix-multiply step). Then you would have each “row” of threads handle a row-wise multiplication with the vector, very similar to the matrix multiply “multiplication” step, and for the addition step you would perform one “row-wise” reduction, to produce a single output element, which would be the analog to the “summation” step in the preceding matrix-multiply op.
Something like this (r=AxBxw):
__global__ void kmmv(const float *d_A, const float *d_B, int m, int n, const float *d_w, float *d_r)
{
float sum = 0;
int i = blockIdx.y * blockDim.y + threadIdx.y; // row index
int j = blockIdx.x * blockDim.x + threadIdx.x; // column index
for(int k=0; k<m; k++) // matrix-multiply step
sum += d_A[i*m+k] * d_B[k*n+j];
sum *= d_w[j]; // element-wise multiplication for matrix-vector step
atomicAdd(d_r+i, sum); // row-wise reduction for matrix vector step
}
Note that we are doing an atomic operation here to “reduce” multiple thread matrix-vector product elements into the final result element. This will require that the d_r array (same size as d_w input array) be set to zero in host code prior to the kernel call.
You can get additional orientation on reduction activities here and also in unit 5 here. There are certainly lots of different possible approaches here, as is often the case in computer science. Again, I advance this info for “learning”. For performance, without further information, my advice would be to use a high-quality library like CUBLAS.
Also note that I don’t think your use of dimensions is correct:
So that is AxBxw
The result of AxB is a mxn matrix. You cannot multiply that by a mx1 vector.
For simplicity my treatment assumes m=n, which makes these considerations disappear.