Elementwise block matrix-vector multiplication

s769.098211 · February 7, 2024, 7:16pm

I have a block matrix A and blocked vector v (both complex), and I want to compute the elementwise product B = Av. Let the block dimension of A be m x n, and let the size of each block be d x d. Each block A_ij of A is diagonal. Thus A is stored as an array A = [diag(A_00), diag(A_01), … diag(A_0n), diag(A_10), … , diag(A_1n), … , diag(A_mn)].

We have two types of vectors. The first type is of block size n x 1 (each block has length t). Then, we compute B by diag(B_ij) = diag(A_ij) * v_j for i = 1…m (“*” here is the elementwise product for vectors).

We also deal with vectors of block size m x 1. Then, we compute B by diag(B_ij) = diag(A_ij) * v_i for j = 1…n.

B is stored the same way as A.

I am currently computing these elementwise products with a custom CUDA kernel. I launch a 2D grid where each threadblock operates on a block of A. Each thread in a block will compute the elementwise product (e.g. diag(B_ij)_t = diag(A_ij)_t * (v_j)_t for the first case), looping over the block size if necessary. I profiled the kernel for both cases on a A100 GPU and am getting 85-90% memory bandwidth utilization.

Is this the best way to do the computation, or are there any improvements?

njuffa · February 7, 2024, 8:28pm

I find the terminology here a bit confusing, in particular the use of “blocked vector”. Is this an accurate re-phrasing: A is block-diagonal matrix, and the code performs a matrix-vector multiplication A*b with that block-diagonal matrix.

Assuming an affirmative answer: Is the block-diagonal matrix A sparse? If so, the following publication may be of interest:

Khaled Z. Ibrahim, Chao Yang, and Pieter Maris, “Performance Portability of Sparse Block Diagonal Matrix Multiple Vector Multiplications on GPUs,” 2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC, 2022, pp. 58-67

One would expect this computation to be limited by memory bandwidth. If you are achieving 85%-90% of theoretical bandwidth, your code already exhausts the available bandwidth, as that is in the range of maximum percentage of theoretical bandwidth that can be achieved in practice.

s769.098211 · February 7, 2024, 8:53pm

No, let me try to clarify what I mean here. I will try to avoid the “blocking” business altogether.

We have a mxn matrix of complex numbers (array in row-major order: A = [a_row_1, a_row_2, …, a_row_m]. We have 2 cases:

Case 1:
we have a vector v of size n x 1. The output in this case is the matrix B = [a_row_1 * v, a_row_2 * v, … , a_row_m * v]. The * means elementwise product between vectors.

Case 2: we have a vector w of size m x 1. The output is now the matrix C = [a_col_1 * w; a_col_2 * w; … ; a_col_n * w]. Again, * is the elementwise product. I showed C in column-major order since that is easier to visualize in this case. In practice, it will be stored in row-major order.

My kernel does achieve the high percentage of theoretical bandwidth, but it does involve duplicate reads of the vector from global memory. For example, in Case 1, the vector is read once for each row of A. I can consider other kernels where the vector is only read once, and each threadblock computes a portion of the output for all rows. This kernel achieves less bandwidth percentage but is faster overall.

Robert_Crovella · February 7, 2024, 9:01pm

Right, that seems the correct approach. I think for case 1 its unlikely that you will find anything faster than that. All reads and writes can be fully coalesced, and you are performing the minimum necessary reads and writes. The only remaining optimization concerns would be related to occupancy considerations for your specific GPU. This usually (in my experience) requires some tuning and testing to get right, and is usually only a consideration of a few percent, barring mistakes that involve dismally low occupancy.

s769.098211 · February 7, 2024, 10:28pm

Thank you. Indeed I am finding that this method (read vector once) is best for case 1 in most cases. For case 2, you have to consider the blocked version for case 2 otherwise it doesn’t really make sense.

Here’s a small explicit example for case 2:

A = [ a_00, a_01, a_02, a_10, a_11, a_12 ] , where each a_ij is a vector of length d
w = [w_0, w_1], where each w_i is a vector of length d

Output: C = [ a_00 * w_0, a_01 * w_0, a_02 * w_0, a_10 * w_1, a_11 * w_1, a_12 * w_1 ] elementwise products

In that case, I’m finding that the algorithm that does each a_ij * w_i independently (with duplicate reads) works fastest.

I’m also trying some algorithms that use 2D threadblocks and shared memory. These seem to work faster in cases where the matrix is very rectangular m >>n or n << m.

s769.098211 · February 8, 2024, 12:32am

Is there a general rule to follow when considering global memory read/writes vs the amount of parallelism?

For example, reading the vector once here uses minimum global memory access, but decreases parallelism since the vector is sequentially multiplied with each row. Reading the vector once per row increases the global memory access, but increases parallelism since the product with each row can be done concurrently.

Is there some kind of middle-ground here?

Robert_Crovella · February 8, 2024, 12:41am

Divide the throughput you care about (e.g. FP32 flops, or FP64 flops, or …) by the memory bandwidth, for the GPU you are running on. The result is a flops/byte ratio. Most GPUs have a number that is noticeably larger than 1 here. If your actual flops/byte ratio (what your code does) is higher than the GPU ratio, then you can “afford” to do some extra memory traffic.

Your code is not likely to be in this category. Extra memory traffic will likely translate into longer kernel duration. My understanding is you have already empirically measured this and found it to be true. There is no “middle ground” to exploit unless you have more compute work to be done, on data you have already loaded.

system · February 22, 2024, 12:42am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fastest matrix-vector multiplication? CUDA Programming and Performance	24	4016	May 21, 2011
Slow Performance CUDA Programming and Performance	19	8127	November 24, 2008
Sparse Matrix -> Vector CUDA Programming and Performance	7	1201	January 28, 2011
Could GPU computing speed up computation in data-larger-than-RAM cases? CUDA Programming and Performance	5	1865	April 19, 2016
Matrix Partitioning how would you implement it? Sparse matrix partitioning CUDA Programming and Performance	16	14655	March 30, 2010
Tell me a way to GPU speed up with banal elementwise multiplication.. Thanks in advance! CUDA Programming and Performance	15	3569	August 5, 2017
Why different shape matrix multiplication have different performance? CUDA Programming and Performance	2	766	August 26, 2018
best possible matrix-vector multiplication performance? poor guy with only an emulator wonders about CUDA Programming and Performance	6	5603	August 12, 2009
How to accelerate N-Vector Scalar Product CUDA Programming and Performance	4	5050	August 8, 2007
Branch divergence, Boundary element exchange Optimization and best practices CUDA Programming and Performance	9	18568	December 13, 2007

Elementwise block matrix-vector multiplication

Related topics