Elementwise block matrix-vector multiplication

I have a block matrix A and blocked vector v (both complex), and I want to compute the elementwise product B = Av. Let the block dimension of A be m x n, and let the size of each block be d x d. Each block A_ij of A is diagonal. Thus A is stored as an array A = [diag(A_00), diag(A_01), … diag(A_0n), diag(A_10), … , diag(A_1n), … , diag(A_mn)].

We have two types of vectors. The first type is of block size n x 1 (each block has length t). Then, we compute B by diag(B_ij) = diag(A_ij) * v_j for i = 1…m (“*” here is the elementwise product for vectors).

We also deal with vectors of block size m x 1. Then, we compute B by diag(B_ij) = diag(A_ij) * v_i for j = 1…n.

B is stored the same way as A.

I am currently computing these elementwise products with a custom CUDA kernel. I launch a 2D grid where each threadblock operates on a block of A. Each thread in a block will compute the elementwise product (e.g. diag(B_ij)_t = diag(A_ij)_t * (v_j)_t for the first case), looping over the block size if necessary. I profiled the kernel for both cases on a A100 GPU and am getting 85-90% memory bandwidth utilization.

Is this the best way to do the computation, or are there any improvements?

I find the terminology here a bit confusing, in particular the use of “blocked vector”. Is this an accurate re-phrasing: A is block-diagonal matrix, and the code performs a matrix-vector multiplication A*b with that block-diagonal matrix.

Assuming an affirmative answer: Is the block-diagonal matrix A sparse? If so, the following publication may be of interest:

Khaled Z. Ibrahim, Chao Yang, and Pieter Maris, “Performance Portability of Sparse Block Diagonal Matrix Multiple Vector Multiplications on GPUs,” 2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC, 2022, pp. 58-67

One would expect this computation to be limited by memory bandwidth. If you are achieving 85%-90% of theoretical bandwidth, your code already exhausts the available bandwidth, as that is in the range of maximum percentage of theoretical bandwidth that can be achieved in practice.

No, let me try to clarify what I mean here. I will try to avoid the “blocking” business altogether.

We have a mxn matrix of complex numbers (array in row-major order: A = [a_row_1, a_row_2, …, a_row_m]. We have 2 cases:

Case 1:
we have a vector v of size n x 1. The output in this case is the matrix B = [a_row_1 * v, a_row_2 * v, … , a_row_m * v]. The * means elementwise product between vectors.

Case 2: we have a vector w of size m x 1. The output is now the matrix C = [a_col_1 * w; a_col_2 * w; … ; a_col_n * w]. Again, * is the elementwise product. I showed C in column-major order since that is easier to visualize in this case. In practice, it will be stored in row-major order.

My kernel does achieve the high percentage of theoretical bandwidth, but it does involve duplicate reads of the vector from global memory. For example, in Case 1, the vector is read once for each row of A. I can consider other kernels where the vector is only read once, and each threadblock computes a portion of the output for all rows. This kernel achieves less bandwidth percentage but is faster overall.

Right, that seems the correct approach. I think for case 1 its unlikely that you will find anything faster than that. All reads and writes can be fully coalesced, and you are performing the minimum necessary reads and writes. The only remaining optimization concerns would be related to occupancy considerations for your specific GPU. This usually (in my experience) requires some tuning and testing to get right, and is usually only a consideration of a few percent, barring mistakes that involve dismally low occupancy.

Thank you. Indeed I am finding that this method (read vector once) is best for case 1 in most cases. For case 2, you have to consider the blocked version for case 2 otherwise it doesn’t really make sense.

Here’s a small explicit example for case 2:

A = [ a_00, a_01, a_02, a_10, a_11, a_12 ] , where each a_ij is a vector of length d
w = [w_0, w_1], where each w_i is a vector of length d

Output: C = [ a_00 * w_0, a_01 * w_0, a_02 * w_0, a_10 * w_1, a_11 * w_1, a_12 * w_1 ] elementwise products

In that case, I’m finding that the algorithm that does each a_ij * w_i independently (with duplicate reads) works fastest.

I’m also trying some algorithms that use 2D threadblocks and shared memory. These seem to work faster in cases where the matrix is very rectangular m >>n or n << m.

Is there a general rule to follow when considering global memory read/writes vs the amount of parallelism?

For example, reading the vector once here uses minimum global memory access, but decreases parallelism since the vector is sequentially multiplied with each row. Reading the vector once per row increases the global memory access, but increases parallelism since the product with each row can be done concurrently.

Is there some kind of middle-ground here?

Divide the throughput you care about (e.g. FP32 flops, or FP64 flops, or …) by the memory bandwidth, for the GPU you are running on. The result is a flops/byte ratio. Most GPUs have a number that is noticeably larger than 1 here. If your actual flops/byte ratio (what your code does) is higher than the GPU ratio, then you can “afford” to do some extra memory traffic.

Your code is not likely to be in this category. Extra memory traffic will likely translate into longer kernel duration. My understanding is you have already empirically measured this and found it to be true. There is no “middle ground” to exploit unless you have more compute work to be done, on data you have already loaded.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.