Concurrent Kernel Execution

mkashifhanif · April 18, 2011, 10:08am

Hi,

Is it possible to run multiple kernels parallel in CUDA? I am running Matrix multiplication kernel for n matrices where n can be any number like 100000.

M[sub]n[/sub] x M[sub]n-1 [/sub] x … M[sub]3[/sub] x M [sub]2[/sub] x M [sub]1[/sub].

If I run this kernel sequentially, the performance of program is much worse than CPU based solution.

I want to run Matrix Multiplication kernel parallel for different matrices. e.g.
M[sub]2[/sub] X M[sub]1[/sub], M[sub]3[/sub] x M[sub]4[/sub], M[sub]5[/sub] x M[sub]6[/sub] …

Or I want to run this multiplication of matrices in blocks like each multiprocessor executes n/mp matrix multiplication in parallel.

Does CUDA provide such functionality? I am using CUDA 2.3 and GeForce 8800 GT.

Best Regards,
Kashif

avidday · April 18, 2011, 10:24am

Concurrent kernel execution is not supported on your hardware. What are the dimensions of each matrix?

mkashifhanif · April 18, 2011, 10:35am

Thanks for reply.

Matrix dimensions are 8 x 8. Can you please tell me which hardware and which CUDA toolkit supports this?

I have another question. Can I control the follow of concurrent kernel execution or concurrent thread execution so that a particular thread or kernel executes on particular multiprocessor?

avidday · April 18, 2011, 10:46am

Only on Fermi hardware. But it doesn’t provide any user control over how parallel execution occurs. All that happens is that if there are sufficient resources, kernels launched into different streams can execute at the same time.

mkashifhanif · April 18, 2011, 10:51am

Is there any limit on number of kernels launched in parallel?

avidday · April 18, 2011, 11:02am

Yes, 16 in the current CUDA 4.0 implementation. But in practice, getting the conditions where 16 would run simultaneously would be very, very hard to achieve.

You need to think about a different approach. The basic product is associative, so there should be plenty of scope to perform the operation in several “passes”. Even so, if I am not misunderstanding what you are trying to do, there isn’t a lot of FLOPs in 100000 8x8 matrix products, so you shouldn’t expect miracles on the GPU speed up achievable.

tera · April 18, 2011, 12:07pm

As avidday already mentioned, this task is most likely memory bandwidth bound. And as the 8800 GT is averse no non-coalesced memory access, optimizing memory access patterns is key.

Use shared memory to load the two matrices you want to multiply in fully coalesced memory reads. Then, instead of going through global memory, write the result to shared memory and reuse it for another multiplication straight away.

Topic		Replies	Views
Maximum concurent kernels For numbers of streams > 16 CUDA Programming and Performance	0	971	April 8, 2011
Concurrently kernels running on one device CUDA Programming and Performance	17	3005	March 2, 2010
Concurrent Kernels On A Given Multiprocessor CUDA Programming and Performance	7	3113	May 30, 2012
Number of concurrent kernel executions on GTX480 CUDA Programming and Performance	11	11495	June 27, 2010
Concurrent kernels execution using streams in multiple CPU threads CUDA Programming and Performance	7	10733	June 26, 2012
parallel computations with CUDA CUDA Programming and Performance	7	3291	September 19, 2008
Multiple kernels in flight? CUDA Programming and Performance	19	27094	August 28, 2007
CUDA 3.0: concurrent kernel launches CUDA Programming and Performance	9	17837	April 1, 2010
How concurrent kernel execution works on Fermi? CUDA Programming and Performance	6	24633	May 14, 2010
Concurrent Kernel Execution CUDA Programming and Performance	2	4578	June 10, 2011

Concurrent Kernel Execution

Related topics