A CUDA kernel that combines two operations on a matrix

In serial code it is fairly straightforward to write a function that takes two (flattened) matrices A, B calculates their product (which is stored in B) and then multiplies B by a vector w. For example

// A is a m x m matrix, B is a m x n matrix, w is a m x 1 vector
void generateAverage(const float *A, float *B, int m, int n, const float *w)
{
   float sum ;
   // matrix multiplication A (m x m) * B (m x n)
   for(int i=0; i<m; i++){
      for(int j=0; j<n; j++){
          sum = 0;
          for(int k=0; k<m; k++)
             sum = A[i*m+k] * B[k*n+j];
          B[i*n+j] = sum ;
       }        
   }

   //  B = B * w
   for(int j=0; j<n; j++){
      sum = 0;
      for(int i=0; i<m; i++){
          sum += B[i*n+j] * w[i];
       }        
       B[(m-1)*m+j] = sum ;  \\store weighted average in last row of B
   }
}

Using the example from cuda c programming guide I can write a kernel that does the matrix multiplication part

__global__ void (const float *d_A, float *d_B, int m, int n, const float *d_w)
{
  float sum = 0;
  int i = blockIdx.y * blockDim.y + threadIdx.y;  // row index
  int j = blockIdx.x * blockDim.x + threadIdx.x;  // column index
  for(int k=0; k<m; k++)
      sum = d_A[i*m+k] * d_B[k*n+j];
  d_B[i*m+j] = sum ;

 // ... todo
}

However I cannot see how I can add the part that multiplies the matrix with the vector. Would anyone have some advise how to go about it?

You would do the sum block-wise and then combine the results of several blocks. Alternatively apply the multiplication with w with the respective input matrix. The two operations are associative.

You have a coding error here:

  sum = d_A[i*m+k] * d_B[k*n+j];

it should be:

  sum += d_A[i*m+k] * d_B[k*n+j];

For the matrix-multiply kernel design you have shown, a couple comments come to mind:

  1. That is not a high-performance matrix multiply. If you care about performance, in my opinion you are starting in the wrong place. I mention this because without specific reasons to do otherwise, most likely the straightforward “relatively high performance” path here is probably not to write your own code but use a library (for both matrix ops) like CUBLAS.
  2. Focusing on the question, for learning purposes, your matrix-multiply kernel “expects” i.e. requires a thread array that is at least the size of B matrix. However the matrix-vector product doesn’t necessarily require that. So you have implementation choices. Do you want to use more threads, and potentially have higher performance, but also more code complexity, or do you want to use fewer threads (e.g. one thread per output point; a typical CUDA thread strategy) with potentially lower performance, but with lower code complexity? Choosing fewer threads in a unified kernel would immediately also raise the question about what threads to use or what the remaining threads would do (probably: nothing).

Let’s suppose you wanted higher performance (and perhaps code complexity), and use all the provisioned threads (all the threads that were used by the preceding matrix-multiply step). Then you would have each “row” of threads handle a row-wise multiplication with the vector, very similar to the matrix multiply “multiplication” step, and for the addition step you would perform one “row-wise” reduction, to produce a single output element, which would be the analog to the “summation” step in the preceding matrix-multiply op.

Something like this (r=AxBxw):

__global__ void kmmv(const float *d_A, const float *d_B, int m, int n, const float *d_w, float *d_r)
{
  float sum = 0;
  int i = blockIdx.y * blockDim.y + threadIdx.y;  // row index
  int j = blockIdx.x * blockDim.x + threadIdx.x;  // column index
  for(int k=0; k<m; k++) // matrix-multiply step
      sum += d_A[i*m+k] * d_B[k*n+j];
  sum *= d_w[j];  // element-wise multiplication for matrix-vector step
  atomicAdd(d_r+i, sum); // row-wise reduction for matrix vector step
}

Note that we are doing an atomic operation here to “reduce” multiple thread matrix-vector product elements into the final result element. This will require that the d_r array (same size as d_w input array) be set to zero in host code prior to the kernel call.

You can get additional orientation on reduction activities here and also in unit 5 here. There are certainly lots of different possible approaches here, as is often the case in computer science. Again, I advance this info for “learning”. For performance, without further information, my advice would be to use a high-quality library like CUBLAS.

Also note that I don’t think your use of dimensions is correct:

So that is AxBxw
The result of AxB is a mxn matrix. You cannot multiply that by a mx1 vector.

For simplicity my treatment assumes m=n, which makes these considerations disappear.

Thanks for the reply.

Regarding dimensions the correct specification is “multiply with the transpose of vector w”; then 1 x m * m x n = 1 x n. Regarding CUBLAS I agree that the code there is far more optimized; however it is very difficult to use since it has a very convoluted process of entering the inputs.

I can see that you bypassed the redundant step of producing a matrix and then multiplying it with the weights by applying the weights directly to the sum. For learning purposes my question is: can we synchronize threads across blocks and then add another for statement like in the serial code?

You mention that, as written, the matrix multiplication is not a high performance matrix multiply. Is this because it does not use shared memory or are there other problems as well?

You can. I mentioned that there are many ways to accomplish this task. You could look into cooperative groups which provide a grid-wide sync mechanism/possibility. Such a mechanism is not necessarily “free” or “optimal”, however. If you try it you may find it performs more slowly than the kernel I have shown. A for-loop would presumably mean that one thread is handling multiple multiply ops (and summations, perhaps) associated with the dot product it is assigned. This will immediately raise the thread selection question I mentioned, if you are doing this in a thread array (grid) that was sized for the matrix multiply. But that is straightforward to manage, if you wanted to do it that way. Since CUDA is largely C++, you have considerable flexibility in how you do things. For example, with a grid sync, you could have a single thread perform all remaining work associated with the matrix-vector product. That probably wouldn’t be a good idea from a performance perspective. The point is, anything between that extreme, and the “other extreme” I have shown, where each existing thread is doing a very small piece of the matrix-vector work, is possible, with appropriate thread selection, loops, and indexing.

A high performance matrix multiply would use shared memory, probably, but it would also use other techniques to minimize redundant loads, and optimally (re)use operands. And probably other items as well. A high performance matrix multiply is generally not a trivial thing to write. You can get some ideas of what is involved by studying the CUTLASS documentation. CUTLASS would be another choice for a higher-performance matrix multiply, in addition to CUBLAS.

You could use 2 kernels. If run from the same stream they have implicit synchronization. That would be the simplest approach, as you can give both kernels the most suitable configuration.

If you are staying with the approach to use one kernel and grid-wide synchronize and want to use multiple threads and blocks, you can reinterpret the threadIdx and blockIdx values:

E.g. you can make a 2D configuration into a 1D one:

threadId = threadIdx.y * blockDim.x + threadIdx.x
blockId = blockIdx.y * gridDim.x + blockIdx.x
blockSize = blockDim.x * blockDim.y
gridSize = gridDim.x * gridDim.y