Concurrent Kernel Execution


Is it possible to run multiple kernels parallel in CUDA? I am running Matrix multiplication kernel for n matrices where n can be any number like 100000.

M[sub]n[/sub] x M[sub]n-1 [/sub] x … M[sub]3[/sub] x M [sub]2[/sub] x M [sub]1[/sub].

If I run this kernel sequentially, the performance of program is much worse than CPU based solution.

I want to run Matrix Multiplication kernel parallel for different matrices. e.g.
M[sub]2[/sub] X M[sub]1[/sub], M[sub]3[/sub] x M[sub]4[/sub], M[sub]5[/sub] x M[sub]6[/sub] …

Or I want to run this multiplication of matrices in blocks like each multiprocessor executes n/mp matrix multiplication in parallel.

Does CUDA provide such functionality? I am using CUDA 2.3 and GeForce 8800 GT.

Best Regards,

Concurrent kernel execution is not supported on your hardware. What are the dimensions of each matrix?

Thanks for reply.

Matrix dimensions are 8 x 8. Can you please tell me which hardware and which CUDA toolkit supports this?

I have another question. Can I control the follow of concurrent kernel execution or concurrent thread execution so that a particular thread or kernel executes on particular multiprocessor?

Only on Fermi hardware. But it doesn’t provide any user control over how parallel execution occurs. All that happens is that if there are sufficient resources, kernels launched into different streams can execute at the same time.

Is there any limit on number of kernels launched in parallel?

Yes, 16 in the current CUDA 4.0 implementation. But in practice, getting the conditions where 16 would run simultaneously would be very, very hard to achieve.

You need to think about a different approach. The basic product is associative, so there should be plenty of scope to perform the operation in several “passes”. Even so, if I am not misunderstanding what you are trying to do, there isn’t a lot of FLOPs in 100000 8x8 matrix products, so you shouldn’t expect miracles on the GPU speed up achievable.

As avidday already mentioned, this task is most likely memory bandwidth bound. And as the 8800 GT is averse no non-coalesced memory access, optimizing memory access patterns is key.

Use shared memory to load the two matrices you want to multiply in fully coalesced memory reads. Then, instead of going through global memory, write the result to shared memory and reuse it for another multiplication straight away.