Advice on simple multiple matrix multiplications ....

I am developing a program where each time a 3x3 matrix must be multiplied with 100 vectors. Write now it takes 0.1 ms on a cpu using Eigen template library in C++ on a MacBook Pro. But I want to do this operation 1000 times for hypothesis generation… Can I make the 100 multiplications parallel? I should follow the code in the Cuda Programming pdf for matrix multiplication? Where should I refer, so that a thread(or a block of threads) is fed with the results of another kernel thread… I mean if I want to implement transpose(v1)A(v2). Should I put the whole operation in a thread for every element?
I am a novice in CUDA. Any advice appreciated.
Thank you.

The matrix multiplication example would not make sense here, as that is intended to distribute the multiplication of a very large matrix over all the stream processors. In your case, you are doing N very small multiplications, so it makes more sense to assign an entire 3x3 matrix multiplication to each thread.

Can you give a little more detail about the 100 vectors vs. the 1000 repetitions? Doing 100 3x3 matrix multiplications is still not very much work for a CUDA device, but 1000 * 100 multiplications would definitely fully utilize the device. Mostly I’m curious how data flows in the calculation, because that will tell you how best to split the calculation between blocks. (Since threads in the same block can communicate through fast shared memory on the multiprocessor, you usually want to group threads operating on the same data in the same block.)

Firstly I have to generate 500 hypothesis from random sampling some data. Each hypothesis generation may generate 4-10 solutions. So we have 500*(4 to 10) different models.
The simplest step is to choose among the 4-10 solutions for each model generation and to derive 500 model hypothesis.
To choose among the solutions,and derive 500 models, each model[from the 4-10] is tested by summing an error over 100 and 100 vectors. LEt’s say v1i, v2i.
So the operations are independent to each other.
The error for each model[3x3 matrix] is Σ[transpose(v1i)*Ε*v2i]. i=1…100 E is a 3x3 matrix. From each of the 4 to 10 models the one with the smallest error is selected.
So we have 500 model estimations. 500 3x3 matrices.