Multiple small matrix multiplication program structure

Someone must have had the same question as I do, so I apologize if this topic is a repeat.

I’m writing a CUDA program which would perform multiple multiplications of small matrices and vectors.

Each operation will have the following structure.

(10 x 100) x (100 x 100) x (100 x 10) x (10 x 100) x (100 x 100) x (100)

where (10 x 100) is a matrix [10][100] and (100) is a 100-element vector.

I will have to perform around 50.000 of such operations, so I’m trying to choose the right structure for such a program…

I was thinking of either

having 1 block doing 1 operation (launch 1 kernel with 50.000 blocks)

Updated:

The biggest problem is that on each stage of this operation I will need different number of threads:

1: (10 x 100) x (100 x 10) = (10 x 10)

so I’ll need only 10x10 threads here

2: (10 x 10) x (10 x 100) = (10 x 100)

100x10 threads are needed here

3: (10 x 100) x (100)

only 10 threads are needed here.

I can’t figure out how many threads should I allocate. If I allocate 100x10, then most of them will stay idle while few will perform things I need them to.

Alternatively, if I allocate say 10x10 threads, there isn’t much parallelism and I’ll have huge problems with coalescing.

or

having 1 kernel to perform 1 operation (launch 50.000 concurrent kernels and possibly save some time with asynchronous data copy)

This topic suggests that I can benefit from concurrent kernels if my block size is small (which is the case for me).

What do you think is best for me?

I think the former case is best. You can only launch concurrent kernels on Fermi hardware, and in the current CUDA release, you can only have 4 concurrent kernels at a time. Launching 50000 kernels is likely to be inefficient.

I think the former case is best. You can only launch concurrent kernels on Fermi hardware, and in the current CUDA release, you can only have 4 concurrent kernels at a time. Launching 50000 kernels is likely to be inefficient.

Thanks. Are there any good implementations for matrix multiplication within 1 block? What is the best way to do so?

Thanks. Are there any good implementations for matrix multiplication within 1 block? What is the best way to do so?

I did a similar approach when wanting to do hundreds or thousands of small QR-decompositions in parallell.

If i have time i plan on writing something similar to what you are doing. Let me know how it goes :)

I did a similar approach when wanting to do hundreds or thousands of small QR-decompositions in parallell.

If i have time i plan on writing something similar to what you are doing. Let me know how it goes :)

Sure will. How did it work for the QR-decomposition? Did you get much speedup? If so, can you share some code?

Sure will. How did it work for the QR-decomposition? Did you get much speedup? If so, can you share some code?

The biggest problem is that on each stage of this operation I will need different number of threads:

1: (10 x 100) x (100 x 10) = (10 x 10)
so I’ll need only 10x10 threads here

2: (10 x 10) x (10 x 100) = (10 x 100)
100x10 threads are needed here

3: (10 x 100) x (100)
only 10 threads are needed here.

I can’t figure out how many threads should I allocate. If I allocate 100x10, then most of them will stay idle while few will perform things I need them to.
Alternatively, if I allocate say 10x10 threads, there isn’t much parallelism and I’ll have huge problems with coalescing.

The biggest problem is that on each stage of this operation I will need different number of threads:

1: (10 x 100) x (100 x 10) = (10 x 10)
so I’ll need only 10x10 threads here

2: (10 x 10) x (10 x 100) = (10 x 100)
100x10 threads are needed here

3: (10 x 100) x (100)
only 10 threads are needed here.

I can’t figure out how many threads should I allocate. If I allocate 100x10, then most of them will stay idle while few will perform things I need them to.
Alternatively, if I allocate say 10x10 threads, there isn’t much parallelism and I’ll have huge problems with coalescing.

You can use the 10x10 block and a 10-iteration for loop for stage 2. If stage 3 is short, then idling most of the threads probably won’t be a big problem.

You can use the 10x10 block and a 10-iteration for loop for stage 2. If stage 3 is short, then idling most of the threads probably won’t be a big problem.

Yeah sure the speed is massive, but I’m using all kinds of black magic, will release a paper on it. :)

As for your threading issue, I would use a small number of threads (multiple of 32) and try to coalesce along the rows as much as possible. Sometimes it can be worthwhile transposing some of the matrices to achieve good coalescing in the the required manner. Branching your warps might be hard to avoid.

Good luck!

Yeah sure the speed is massive, but I’m using all kinds of black magic, will release a paper on it. :)

As for your threading issue, I would use a small number of threads (multiple of 32) and try to coalesce along the rows as much as possible. Sometimes it can be worthwhile transposing some of the matrices to achieve good coalescing in the the required manner. Branching your warps might be hard to avoid.

Good luck!

The problem is that in most of the cases I don’t get past 10 threads, so I can’t use any multiples of 32.

Another possibility I look at is calling separate kernels for each computation:

1st kernel will perform (10 x 100) x (100 x 10) and use 10x10 block.

Then I’ll syncthreads and run the next kernel which will do (10 x 10) x (10 x 100) - 10 x 100 block size

the last kernel will do (10 x 100) x (100) with only 10 threads in it.

Pros: No threads are wasted.

Cons: I want to use shared memory for all multiplications, so I’ll have to copy to and from shared memory each time I run the new kernel.

The question is: will the copy time (global->shared->global) cancel out the time I save on not wasting the threads.

The problem is that in most of the cases I don’t get past 10 threads, so I can’t use any multiples of 32.

Another possibility I look at is calling separate kernels for each computation:

1st kernel will perform (10 x 100) x (100 x 10) and use 10x10 block.

Then I’ll syncthreads and run the next kernel which will do (10 x 10) x (10 x 100) - 10 x 100 block size

the last kernel will do (10 x 100) x (100) with only 10 threads in it.

Pros: No threads are wasted.

Cons: I want to use shared memory for all multiplications, so I’ll have to copy to and from shared memory each time I run the new kernel.

The question is: will the copy time (global->shared->global) cancel out the time I save on not wasting the threads.

You had one case where you felt you needed 10 threads and 2 cases with 100-1000 threads?

Yes i think it most definetly will.

You had one case where you felt you needed 10 threads and 2 cases with 100-1000 threads?

Yes i think it most definetly will.