Multiple small matrix multiplication program structure

wk-cof · April 18, 2010, 2:38am

Someone must have had the same question as I do, so I apologize if this topic is a repeat.

I’m writing a CUDA program which would perform multiple multiplications of small matrices and vectors.

Each operation will have the following structure.

(10 x 100) x (100 x 100) x (100 x 10) x (10 x 100) x (100 x 100) x (100)

where (10 x 100) is a matrix [10][100] and (100) is a 100-element vector.

I will have to perform around 50.000 of such operations, so I’m trying to choose the right structure for such a program…

I was thinking of either

having 1 block doing 1 operation (launch 1 kernel with 50.000 blocks)

Updated:

The biggest problem is that on each stage of this operation I will need different number of threads:

1: (10 x 100) x (100 x 10) = (10 x 10)

so I’ll need only 10x10 threads here

2: (10 x 10) x (10 x 100) = (10 x 100)

100x10 threads are needed here

3: (10 x 100) x (100)

only 10 threads are needed here.

I can’t figure out how many threads should I allocate. If I allocate 100x10, then most of them will stay idle while few will perform things I need them to.

Alternatively, if I allocate say 10x10 threads, there isn’t much parallelism and I’ll have huge problems with coalescing.

or

having 1 kernel to perform 1 operation (launch 50.000 concurrent kernels and possibly save some time with asynchronous data copy)

This topic suggests that I can benefit from concurrent kernels if my block size is small (which is the case for me).

What do you think is best for me?

seibert · April 18, 2010, 3:43am

I think the former case is best. You can only launch concurrent kernels on Fermi hardware, and in the current CUDA release, you can only have 4 concurrent kernels at a time. Launching 50000 kernels is likely to be inefficient.

seibert · April 18, 2010, 3:43am

I think the former case is best. You can only launch concurrent kernels on Fermi hardware, and in the current CUDA release, you can only have 4 concurrent kernels at a time. Launching 50000 kernels is likely to be inefficient.

wk-cof · April 18, 2010, 8:50am

Thanks. Are there any good implementations for matrix multiplication within 1 block? What is the best way to do so?

wk-cof · April 18, 2010, 8:50am

Thanks. Are there any good implementations for matrix multiplication within 1 block? What is the best way to do so?

Jimmy_Pettersson · April 18, 2010, 12:34pm

I did a similar approach when wanting to do hundreds or thousands of small QR-decompositions in parallell.

If i have time i plan on writing something similar to what you are doing. Let me know how it goes :)

Jimmy_Pettersson · April 18, 2010, 12:34pm

I did a similar approach when wanting to do hundreds or thousands of small QR-decompositions in parallell.

If i have time i plan on writing something similar to what you are doing. Let me know how it goes :)

wk-cof · April 18, 2010, 2:50pm

Sure will. How did it work for the QR-decomposition? Did you get much speedup? If so, can you share some code?

wk-cof · April 18, 2010, 2:50pm

Sure will. How did it work for the QR-decomposition? Did you get much speedup? If so, can you share some code?

wk-cof · April 18, 2010, 4:21pm

The biggest problem is that on each stage of this operation I will need different number of threads:

1: (10 x 100) x (100 x 10) = (10 x 10)
so I’ll need only 10x10 threads here

2: (10 x 10) x (10 x 100) = (10 x 100)
100x10 threads are needed here

3: (10 x 100) x (100)
only 10 threads are needed here.

I can’t figure out how many threads should I allocate. If I allocate 100x10, then most of them will stay idle while few will perform things I need them to.
Alternatively, if I allocate say 10x10 threads, there isn’t much parallelism and I’ll have huge problems with coalescing.

wk-cof · April 18, 2010, 4:21pm

The biggest problem is that on each stage of this operation I will need different number of threads:

1: (10 x 100) x (100 x 10) = (10 x 10)
so I’ll need only 10x10 threads here

2: (10 x 10) x (10 x 100) = (10 x 100)
100x10 threads are needed here

3: (10 x 100) x (100)
only 10 threads are needed here.

I can’t figure out how many threads should I allocate. If I allocate 100x10, then most of them will stay idle while few will perform things I need them to.
Alternatively, if I allocate say 10x10 threads, there isn’t much parallelism and I’ll have huge problems with coalescing.

seibert · April 18, 2010, 5:31pm

You can use the 10x10 block and a 10-iteration for loop for stage 2. If stage 3 is short, then idling most of the threads probably won’t be a big problem.

seibert · April 18, 2010, 5:31pm

You can use the 10x10 block and a 10-iteration for loop for stage 2. If stage 3 is short, then idling most of the threads probably won’t be a big problem.

Jimmy_Pettersson · April 18, 2010, 5:36pm

Yeah sure the speed is massive, but I’m using all kinds of black magic, will release a paper on it. :)

As for your threading issue, I would use a small number of threads (multiple of 32) and try to coalesce along the rows as much as possible. Sometimes it can be worthwhile transposing some of the matrices to achieve good coalescing in the the required manner. Branching your warps might be hard to avoid.

Good luck!

Jimmy_Pettersson · April 18, 2010, 5:36pm

Yeah sure the speed is massive, but I’m using all kinds of black magic, will release a paper on it. :)

As for your threading issue, I would use a small number of threads (multiple of 32) and try to coalesce along the rows as much as possible. Sometimes it can be worthwhile transposing some of the matrices to achieve good coalescing in the the required manner. Branching your warps might be hard to avoid.

Good luck!

wk-cof · April 18, 2010, 6:19pm

The problem is that in most of the cases I don’t get past 10 threads, so I can’t use any multiples of 32.

Another possibility I look at is calling separate kernels for each computation:

1st kernel will perform (10 x 100) x (100 x 10) and use 10x10 block.

Then I’ll syncthreads and run the next kernel which will do (10 x 10) x (10 x 100) - 10 x 100 block size

the last kernel will do (10 x 100) x (100) with only 10 threads in it.

Pros: No threads are wasted.

Cons: I want to use shared memory for all multiplications, so I’ll have to copy to and from shared memory each time I run the new kernel.

The question is: will the copy time (global->shared->global) cancel out the time I save on not wasting the threads.

wk-cof · April 18, 2010, 6:19pm

The problem is that in most of the cases I don’t get past 10 threads, so I can’t use any multiples of 32.

Another possibility I look at is calling separate kernels for each computation:

1st kernel will perform (10 x 100) x (100 x 10) and use 10x10 block.

Then I’ll syncthreads and run the next kernel which will do (10 x 10) x (10 x 100) - 10 x 100 block size

the last kernel will do (10 x 100) x (100) with only 10 threads in it.

Pros: No threads are wasted.

Cons: I want to use shared memory for all multiplications, so I’ll have to copy to and from shared memory each time I run the new kernel.

The question is: will the copy time (global->shared->global) cancel out the time I save on not wasting the threads.

Jimmy_Pettersson · April 18, 2010, 8:16pm

You had one case where you felt you needed 10 threads and 2 cases with 100-1000 threads?

Yes i think it most definetly will.

Jimmy_Pettersson · April 18, 2010, 8:16pm

You had one case where you felt you needed 10 threads and 2 cases with 100-1000 threads?

Yes i think it most definetly will.

Topic		Replies	Views
Advice on simple multiple matrix multiplications .... CUDA Programming and Performance	2	1268	April 4, 2010
cuda matrix multiplication with different sizes CUDA Programming and Performance	0	710	January 27, 2015
matrix multiplication CUDA Programming and Performance	10	3958	March 7, 2010
Multiplying a system of 3x3 matrices efficiently CUDA Programming and Performance	2	8969	September 11, 2009
Ordered Multiplication of various matrices in shared memory greater minds please help CUDA Programming and Performance	10	2553	June 22, 2009
CUDA Matrix Multiplication Issues threads and blocks problem CUDA Programming and Performance	2	3684	March 1, 2009
Matrix multiplication woes large inner, small outer dimensions CUDA Programming and Performance	21	10384	March 24, 2009
problem of matrix multiplication vector x matrix CUDA Programming and Performance	4	1305	August 22, 2010
Naive Matrix Multiplication: Confused about Optimal Threadblock Size CUDA Programming and Performance cuda	8	569	June 11, 2024
Parallel Matrix Multiplication in Cuda - A Question about Threads/Blocks and Tensor Cores Jetson Xavier NX cuda	3	1071	October 18, 2021

Multiple small matrix multiplication program structure

Related topics