Advice on simple multiple matrix multiplications ....

grekop · April 4, 2010, 1:04pm

I am developing a program where each time a 3x3 matrix must be multiplied with 100 vectors. Write now it takes 0.1 ms on a cpu using Eigen template library in C++ on a MacBook Pro. But I want to do this operation 1000 times for hypothesis generation… Can I make the 100 multiplications parallel? I should follow the code in the Cuda Programming pdf for matrix multiplication? Where should I refer, so that a thread(or a block of threads) is fed with the results of another kernel thread… I mean if I want to implement transpose(v1)A(v2). Should I put the whole operation in a thread for every element?
I am a novice in CUDA. Any advice appreciated.
Thank you.

seibert · April 4, 2010, 3:01pm

The matrix multiplication example would not make sense here, as that is intended to distribute the multiplication of a very large matrix over all the stream processors. In your case, you are doing N very small multiplications, so it makes more sense to assign an entire 3x3 matrix multiplication to each thread.

Can you give a little more detail about the 100 vectors vs. the 1000 repetitions? Doing 100 3x3 matrix multiplications is still not very much work for a CUDA device, but 1000 * 100 multiplications would definitely fully utilize the device. Mostly I’m curious how data flows in the calculation, because that will tell you how best to split the calculation between blocks. (Since threads in the same block can communicate through fast shared memory on the multiprocessor, you usually want to group threads operating on the same data in the same block.)

grekop · April 4, 2010, 3:15pm

Firstly I have to generate 500 hypothesis from random sampling some data. Each hypothesis generation may generate 4-10 solutions. So we have 500*(4 to 10) different models.
The simplest step is to choose among the 4-10 solutions for each model generation and to derive 500 model hypothesis.
To choose among the solutions,and derive 500 models, each model[from the 4-10] is tested by summing an error over 100 and 100 vectors. LEt’s say v1i, v2i.
So the operations are independent to each other.
The error for each model[3x3 matrix] is Î£[transpose(v1i)*Î•*v2i]. i=1…100 E is a 3x3 matrix. From each of the 4 to 10 models the one with the smallest error is selected.
So we have 500 model estimations. 500 3x3 matrices.

Topic		Replies	Views
Multiplying a system of 3x3 matrices efficiently CUDA Programming and Performance	2	8837	September 11, 2009
Multiple small matrix multiplication program structure CUDA Programming and Performance	18	7630	April 18, 2010
Multiplying Rectangular Matrices CUDA Programming and Performance	1	2348	March 29, 2007
Parallel Matrix Multiplication in Cuda - A Question about Threads/Blocks and Tensor Cores Jetson Xavier NX cuda	3	967	October 18, 2021
Large matrix multiplication for neural network purpose CUDA Programming and Performance	1	731	October 6, 2016
Parallel computing by cpu thread and gpu kernel CUDA Programming and Performance	5	1279	November 21, 2014
Pls help - Matrix multiplication CUDA Programming and Performance	0	697	February 9, 2011
How to add pointer array value CUDA Programming and Performance	13	1752	May 2, 2019
Is to possible to speed up multiple matrix per vector multiplication using CUDA? CUDA Programming and Performance	2	1407	April 12, 2010
Mid-size matrix arithmetic in Cuda Is cuda appropriate? CUDA Programming and Performance	3	2663	June 6, 2008

Advice on simple multiple matrix multiplications ....

Related topics