Number of concurrent kernel executions on GTX480

Nick_S · March 27, 2010, 8:14am

One of the new CUDA features of the new GTX4xx family is concurrent kernel executions. I have gone through several of the reviews and official documentation but I can not find any numbers. Can anybody tell me how many concurrent kernel executions the GTX480 and GTX470 support? Thanks.

TrekCZ · March 27, 2010, 10:42am

If I understand it correctly it can run 4 kernels simultaneously (hopefully good for PhysX games).
These reviews - are usually game reviews.

Nick_S · March 27, 2010, 2:22pm

Thanks for the reply. Do you have a source for that?

SPWorley · March 27, 2010, 4:53pm

The 3.0 programming guide says “at most four”. It’s likely 2 or 1 for the followon lower end parts with fewer SMs.

There’s a severe restriction about concurrent kernels… they must all be from the same context! So if you have two CUDA apps, you couldn’t run them both at once on the same GPU for example. The OS display can also not run concurrently with kernels, so there’s still a kernel watchdog timer.

tmurray · March 27, 2010, 10:12pm

In 3.0, it’s limited to four. In a future software release, it will be higher.

Manuel_M · June 14, 2010, 3:12pm

But is that four concurrent kernels per multiprocessor? Or is it four concurrent kernels in the whole GPU?

seibert · June 14, 2010, 3:52pm

For the whole GPU. CUDA 3.1 beta ups this limit to 16.

BlahCuda · June 14, 2010, 4:42pm

Has anyone seen any significant performance boost utilizing concurrent kernel executions?

kappa · June 14, 2010, 5:08pm

(Edit: sorry–should have been topic–not post–see down towards then end of the thread for why I think so–although the whole thread is a good use case)

I would guess that most people have not–see this discussion for why I think so: [topic=“171067”]Cuda with openMP thread[/topic]

BlahCuda · June 14, 2010, 5:15pm

I don’t think your link is the correct one. The discussion there has nothing to do w/ CUDA with OpenMP.

kappa · June 14, 2010, 5:36pm

This post is just to say that I fixed the link above.

kappa · June 27, 2010, 5:53am

Since nobody else has answered–I guess I will. Here are two Kappa scheduling scripts that only differ by the assignment of a stream to a kernel–one has the same stream assigned to all kernel executions so that the kernels execute sequentially and the other allows Kappa to assign different streams so that the kernels execute concurrently. The results I will show are for a gtx470 but should not be significantly different for a gtx480. Also these results are using the CUDA ToolKit 3.1 with Kappa 1.2.0 (and do not forget to use the 256.35 driver–you only get 3.0 ToolKit performance with the 195.xx drivers).

Here is the kernel to be executed (extracted from the NVIDIA SDK with a ‘extern “C”’ added so the kernel is externally visible):

[codebox]/*

Copyright 1993-2010 NVIDIA Corporation. All rights reserved.
NVIDIA Corporation and its licensors retain all intellectual property and
proprietary rights in and to this software and related documentation.
Any use, reproduction, disclosure, or distribution of this software
and related documentation without an express license agreement from
NVIDIA Corporation is strictly prohibited.
Please refer to the applicable NVIDIA end user license agreement (EULA)
associated with this source code for terms and conditions that govern
your use of this NVIDIA software.

*/

//

// This sample demonstrates the use of streams for concurrent execution

//

// Devices of compute capability 1.x will run the kernels one after another

// Devices of compute capability 2.0 or higher can overlap the kernels

//

extern “C”

global void mykernel( int *a, int n )

{

int idx = threadIdx.x;

int value = 1;

for(int i=0; i<n; i++)

    value *= sin( (float)i ) + tan( (float)i );

a[idx] = value;

}

[/codebox]

(Save this into a file named: cuda/concurrentKernels.cu).

Here is the version that is not concurrent (since it is sequential on the same stream):

[codebox]

!CUDA/Kernel

STREAM='astream'

BLOCKSHAPE=[ 4, 64 ]

→ mykernel@concurrent(A,#n) [ A = #n ];

//!Context CONTEXT_FLAGS=%CUDA{CU_CTX_SCHED_SPIN} → context;

!Context CONTEXT_FLAGS=%CUDA{CU_CTX_SCHED_AUTO} → context;

!Value → nblocks = 4; // number of blocks

!Value → nthreads = 64; // number of threads

!Value → n = 50000;

!Value → nkernels = 8; // number of kernels

!Value → nelements = (8 * 64); // variable A size

!CUDA/Module MODULE_TYPE=%KAPPA{CU_MODULE} → concurrent = ‘concurrentKernels.cu’;

!Variable VARIABLE_TYPE=%KAPPA{DeviceOnly} → A(#nelements,%sizeof{uint32_t});

!Timer → mult;

!Subroutine → kernelsub;

!Timer → mult;

!Subroutine EXPAND=true LOOP=100 → kernelsub;

!Timer → mult;

!CUDA/Kernel/Attributes MODULE=concurrent → mykernel;

!Print ( ‘MaxThreadsPerBlock’,

   /kappa/CUDA/concurrent/mykernel#MaxThreadsPerBlock,

   'RegistersPerThread', 

   /kappa/CUDA/concurrent/mykernel#RegistersPerThread );

!Print ( ‘StaticSharedMemory’,

   /kappa/CUDA/concurrent/mykernel#StaticSharedMemory,

   'ConstantMemory', 

   /kappa/CUDA/concurrent/mykernel#ConstantMemory,

   'ThreadLocalMemory', 

   /kappa/CUDA/concurrent/mykernel#ThreadLocalMemory );

!Print ( ‘PTXVersion’, /kappa/CUDA/concurrent/mykernel#PTXVersion,

     'BinaryVersion', /kappa/CUDA/concurrent/mykernel#BinaryVersion );

!Free → A;

!CUDA/ModuleUnload → concurrent;

!ContextReset → Context_reset;

//!Context → context;

!Stop;

!Finish;

[/codebox]

and here is the version (missing the: STREAM=‘astream’ to put the kernels all on the same stream) that executes concurrently:

[codebox]

!CUDA/Kernel

BLOCKSHAPE=[ 4, 64 ]

→ mykernel@concurrent(A,#n) [ A = #n ];

//!Context CONTEXT_FLAGS=%CUDA{CU_CTX_SCHED_SPIN} → context;

!Context CONTEXT_FLAGS=%CUDA{CU_CTX_SCHED_AUTO} → context;

!Value → nblocks = 4; // number of blocks

!Value → nthreads = 64; // number of threads

!Value → n = 50000;

!Value → nkernels = 8; // number of kernels

!Value → nelements = (8 * 64); // variable A size

!CUDA/Module MODULE_TYPE=%KAPPA{CU_MODULE} → concurrent = ‘concurrentKernels.cu’;

!Variable VARIABLE_TYPE=%KAPPA{DeviceOnly} → A(#nelements,%sizeof{uint32_t});

!Timer → mult;

!Subroutine → kernelsub;

!Timer → mult;

!Subroutine EXPAND=true LOOP=100 → kernelsub;

!Timer → mult;

!CUDA/Kernel/Attributes MODULE=concurrent → mykernel;

!Print ( ‘MaxThreadsPerBlock’,

   /kappa/CUDA/concurrent/mykernel#MaxThreadsPerBlock,

   'RegistersPerThread', 

   /kappa/CUDA/concurrent/mykernel#RegistersPerThread );

!Print ( ‘StaticSharedMemory’,

   /kappa/CUDA/concurrent/mykernel#StaticSharedMemory,

   'ConstantMemory', 

   /kappa/CUDA/concurrent/mykernel#ConstantMemory,

   'ThreadLocalMemory', 

   /kappa/CUDA/concurrent/mykernel#ThreadLocalMemory );

!Print ( ‘PTXVersion’, /kappa/CUDA/concurrent/mykernel#PTXVersion,

     'BinaryVersion', /kappa/CUDA/concurrent/mykernel#BinaryVersion );

!Free → A;

!CUDA/ModuleUnload → concurrent;

!ContextReset → Context_reset;

//!Context → context;

!Stop;

!Finish;

[/codebox]

and here are the results for

1> nonconcurrent:

[codebox]/usr/bin/time ikappa k/nvnonconcurrent.k

Processing time: 0.141728 (ms)

MaxThreadsPerBlock 1024 RegistersPerThread 18

StaticSharedMemory 0 ConstantMemory 24 ThreadLocalMemory 4

PTXVersion 10 BinaryVersion 20

Processing time: 4486.72 (ms)

1.40user 0.17system 0:04.72elapsed 33%CPU (0avgtext+0avgdata 81376maxresident)k

0inputs+0outputs (0major+1867minor)pagefaults 0swaps

[/codebox]

and for:

2> concurrent:

[codebox]/usr/bin/time ikappa k/nvconcurrent.k

Processing time: 0.013824 (ms)

MaxThreadsPerBlock 1024 RegistersPerThread 18

StaticSharedMemory 0 ConstantMemory 24 ThreadLocalMemory 4

PTXVersion 10 BinaryVersion 20

Processing time: 391.836 (ms)

0.24user 0.08system 0:00.61elapsed 52%CPU (0avgtext+0avgdata 81392maxresident)k

0inputs+0outputs (0major+1867minor)pagefaults 0swaps

[/codebox]

The first ‘Processing time’ for each result is for executing the kernel once–the second ‘Processing time’ is for executing it 100 times (the LOOP parameter to the subroutine expansion).

(With CUDA ToolKit 3.1/Driver 195.xx, the concurrent times were around 1200 (ms)).

So this shows a speed up of:

4486.72 / 391.836 = 11.45

You can also do the calculation that NVIDIA does in their example:

391.836 / (100 * 0.013824) = 283

which is the formula:

N iteration result / ( number iterations * single iteration result)

but I do not believe that result.

Just for fun, you can try adding:

CACHE=%CUDA{CU_FUNC_CACHE_PREFER_L1}

or

CACHE=%CUDA{CU_FUNC_CACHE_PREFER_SHARED}

similar to:

[codebox]!CUDA/Kernel

BLOCKSHAPE=[ 4, 64 ]

CACHE=%CUDA{CU_FUNC_CACHE_PREFER_SHARED}

→ mykernel@concurrent(A,#n) [ A = #n ];

[/codebox]

to try the two different cache/shared memory options.

Also try changing (adding) the STREAM_POOL_SIZE configuration setting for Kappa in the kappa.conf file:

[codebox][/Kappa]

PROCESSES_INCREMENT=8

STREAM_POOL_SIZE=128

[/codebox]

it is safe to set the STREAM_POOL_SIZE large (2048 is fine for example) since real CUDA streams are only created if your program can make use of them.

You can try this using the free demo license of Kappa from psilambda.com. You can try your own kernels (assuming you have a GF100 class card). These scheduling scripts show you the JIT compiled attributes for the kernel as actually used by the GPU–this can be useful if you are trying to tweak your kernel to execute more concurrently.

Topic		Replies	Views
Concurrent kernels execution using streams in multiple CPU threads CUDA Programming and Performance	7	10789	June 26, 2012
Max 1 or 2 concurrent kernels per SM? CUDA Programming and Performance	19	12100	May 22, 2014
Is it possible to execute two kernels concurrently? CUDA Programming and Performance	18	6963	July 2, 2010
Concurrently kernels running on one device CUDA Programming and Performance	17	3105	March 2, 2010
Concurrent Kernel using GTX 570 on WinXp Concurrent Kernel CUDA Programming and Performance	0	599	October 14, 2011
Concurrent kernel execution CUDA Programming and Performance	0	512	February 23, 2013
CUDA 4.0 concurrent kernels CUDA Programming and Performance	6	1859	March 28, 2011
How concurrent kernel execution works on Fermi? CUDA Programming and Performance	6	24666	May 14, 2010
How much speedup can we get using concurrent kernel? Concurrent Kernel CUDA Programming and Performance	0	2447	October 14, 2011
Concurrent kernel and events on Kepler CUDA Programming and Performance	16	11222	January 29, 2014

Number of concurrent kernel executions on GTX480

Related topics