Number of concurrent kernel executions on GTX480

One of the new CUDA features of the new GTX4xx family is concurrent kernel executions. I have gone through several of the reviews and official documentation but I can not find any numbers. Can anybody tell me how many concurrent kernel executions the GTX480 and GTX470 support? Thanks.

If I understand it correctly it can run 4 kernels simultaneously (hopefully good for PhysX games).
These reviews - are usually game reviews.

Thanks for the reply. Do you have a source for that?

The 3.0 programming guide says “at most four”. It’s likely 2 or 1 for the followon lower end parts with fewer SMs.

There’s a severe restriction about concurrent kernels… they must all be from the same context! So if you have two CUDA apps, you couldn’t run them both at once on the same GPU for example. The OS display can also not run concurrently with kernels, so there’s still a kernel watchdog timer.

In 3.0, it’s limited to four. In a future software release, it will be higher.

But is that four concurrent kernels per multiprocessor? Or is it four concurrent kernels in the whole GPU?

For the whole GPU. CUDA 3.1 beta ups this limit to 16.

Has anyone seen any significant performance boost utilizing concurrent kernel executions?

(Edit: sorry–should have been topic–not post–see down towards then end of the thread for why I think so–although the whole thread is a good use case)

I would guess that most people have not–see this discussion for why I think so: [topic=“171067”]Cuda with openMP thread[/topic]

I don’t think your link is the correct one. The discussion there has nothing to do w/ CUDA with OpenMP.

This post is just to say that I fixed the link above.

Since nobody else has answered–I guess I will. Here are two Kappa scheduling scripts that only differ by the assignment of a stream to a kernel–one has the same stream assigned to all kernel executions so that the kernels execute sequentially and the other allows Kappa to assign different streams so that the kernels execute concurrently. The results I will show are for a gtx470 but should not be significantly different for a gtx480. Also these results are using the CUDA ToolKit 3.1 with Kappa 1.2.0 (and do not forget to use the 256.35 driver–you only get 3.0 ToolKit performance with the 195.xx drivers).

Here is the kernel to be executed (extracted from the NVIDIA SDK with a ‘extern “C”’ added so the kernel is externally visible):

[codebox]/*

  • Copyright 1993-2010 NVIDIA Corporation. All rights reserved.

  • NVIDIA Corporation and its licensors retain all intellectual property and

  • proprietary rights in and to this software and related documentation.

  • Any use, reproduction, disclosure, or distribution of this software

  • and related documentation without an express license agreement from

  • NVIDIA Corporation is strictly prohibited.

  • Please refer to the applicable NVIDIA end user license agreement (EULA)

  • associated with this source code for terms and conditions that govern

  • your use of this NVIDIA software.

*/

//

// This sample demonstrates the use of streams for concurrent execution

//

// Devices of compute capability 1.x will run the kernels one after another

// Devices of compute capability 2.0 or higher can overlap the kernels

//

extern “C”

global void mykernel( int *a, int n )

{

int idx = threadIdx.x;

int value = 1;

for(int i=0; i<n; i++)

    value *= sin( (float)i ) + tan( (float)i ); 

a[idx] = value;

}

[/codebox]

(Save this into a file named: cuda/concurrentKernels.cu).

Here is the version that is not concurrent (since it is sequential on the same stream):

[codebox]

!CUDA/Kernel

STREAM='astream'

BLOCKSHAPE=[ 4, 64 ]

-> mykernel@concurrent(A,#n) [ A = #n ];

//!Context CONTEXT_FLAGS=%CUDA{CU_CTX_SCHED_SPIN} -> context;

!Context CONTEXT_FLAGS=%CUDA{CU_CTX_SCHED_AUTO} -> context;

!Value -> nblocks = 4; // number of blocks

!Value -> nthreads = 64; // number of threads

!Value -> n = 50000;

!Value -> nkernels = 8; // number of kernels

!Value -> nelements = (8 * 64); // variable A size

!CUDA/Module MODULE_TYPE=%KAPPA{CU_MODULE} -> concurrent = ‘concurrentKernels.cu’;

!Variable VARIABLE_TYPE=%KAPPA{DeviceOnly} -> A(#nelements,%sizeof{uint32_t});

!Timer -> mult;

!Subroutine -> kernelsub;

!Timer -> mult;

!Subroutine EXPAND=true LOOP=100 -> kernelsub;

!Timer -> mult;

!CUDA/Kernel/Attributes MODULE=concurrent -> mykernel;

!Print ( ‘MaxThreadsPerBlock’,

   /kappa/CUDA/concurrent/mykernel#MaxThreadsPerBlock,

   'RegistersPerThread', 

   /kappa/CUDA/concurrent/mykernel#RegistersPerThread );

!Print ( ‘StaticSharedMemory’,

   /kappa/CUDA/concurrent/mykernel#StaticSharedMemory,

   'ConstantMemory', 

   /kappa/CUDA/concurrent/mykernel#ConstantMemory,

   'ThreadLocalMemory', 

   /kappa/CUDA/concurrent/mykernel#ThreadLocalMemory );

!Print ( ‘PTXVersion’, /kappa/CUDA/concurrent/mykernel#PTXVersion,

     'BinaryVersion', /kappa/CUDA/concurrent/mykernel#BinaryVersion );

!Free -> A;

!CUDA/ModuleUnload -> concurrent;

!ContextReset -> Context_reset;

//!Context -> context;

!Stop;

!Finish;

[/codebox]

and here is the version (missing the: STREAM=‘astream’ to put the kernels all on the same stream) that executes concurrently:

[codebox]

!CUDA/Kernel

BLOCKSHAPE=[ 4, 64 ]

-> mykernel@concurrent(A,#n) [ A = #n ];

//!Context CONTEXT_FLAGS=%CUDA{CU_CTX_SCHED_SPIN} -> context;

!Context CONTEXT_FLAGS=%CUDA{CU_CTX_SCHED_AUTO} -> context;

!Value -> nblocks = 4; // number of blocks

!Value -> nthreads = 64; // number of threads

!Value -> n = 50000;

!Value -> nkernels = 8; // number of kernels

!Value -> nelements = (8 * 64); // variable A size

!CUDA/Module MODULE_TYPE=%KAPPA{CU_MODULE} -> concurrent = ‘concurrentKernels.cu’;

!Variable VARIABLE_TYPE=%KAPPA{DeviceOnly} -> A(#nelements,%sizeof{uint32_t});

!Timer -> mult;

!Subroutine -> kernelsub;

!Timer -> mult;

!Subroutine EXPAND=true LOOP=100 -> kernelsub;

!Timer -> mult;

!CUDA/Kernel/Attributes MODULE=concurrent -> mykernel;

!Print ( ‘MaxThreadsPerBlock’,

   /kappa/CUDA/concurrent/mykernel#MaxThreadsPerBlock,

   'RegistersPerThread', 

   /kappa/CUDA/concurrent/mykernel#RegistersPerThread );

!Print ( ‘StaticSharedMemory’,

   /kappa/CUDA/concurrent/mykernel#StaticSharedMemory,

   'ConstantMemory', 

   /kappa/CUDA/concurrent/mykernel#ConstantMemory,

   'ThreadLocalMemory', 

   /kappa/CUDA/concurrent/mykernel#ThreadLocalMemory );

!Print ( ‘PTXVersion’, /kappa/CUDA/concurrent/mykernel#PTXVersion,

     'BinaryVersion', /kappa/CUDA/concurrent/mykernel#BinaryVersion );

!Free -> A;

!CUDA/ModuleUnload -> concurrent;

!ContextReset -> Context_reset;

//!Context -> context;

!Stop;

!Finish;

[/codebox]

and here are the results for

1> nonconcurrent:

[codebox]/usr/bin/time ikappa k/nvnonconcurrent.k

Processing time: 0.141728 (ms)

MaxThreadsPerBlock 1024 RegistersPerThread 18

StaticSharedMemory 0 ConstantMemory 24 ThreadLocalMemory 4

PTXVersion 10 BinaryVersion 20

Processing time: 4486.72 (ms)

1.40user 0.17system 0:04.72elapsed 33%CPU (0avgtext+0avgdata 81376maxresident)k

0inputs+0outputs (0major+1867minor)pagefaults 0swaps

[/codebox]

and for:

2> concurrent:

[codebox]/usr/bin/time ikappa k/nvconcurrent.k

Processing time: 0.013824 (ms)

MaxThreadsPerBlock 1024 RegistersPerThread 18

StaticSharedMemory 0 ConstantMemory 24 ThreadLocalMemory 4

PTXVersion 10 BinaryVersion 20

Processing time: 391.836 (ms)

0.24user 0.08system 0:00.61elapsed 52%CPU (0avgtext+0avgdata 81392maxresident)k

0inputs+0outputs (0major+1867minor)pagefaults 0swaps

[/codebox]

The first ‘Processing time’ for each result is for executing the kernel once–the second ‘Processing time’ is for executing it 100 times (the LOOP parameter to the subroutine expansion).

(With CUDA ToolKit 3.1/Driver 195.xx, the concurrent times were around 1200 (ms)).

So this shows a speed up of:

4486.72 / 391.836 = 11.45

You can also do the calculation that NVIDIA does in their example:

391.836 / (100 * 0.013824) = 283

which is the formula:

N iteration result / ( number iterations * single iteration result)

but I do not believe that result.

Just for fun, you can try adding:

CACHE=%CUDA{CU_FUNC_CACHE_PREFER_L1}

or

CACHE=%CUDA{CU_FUNC_CACHE_PREFER_SHARED}

similar to:

[codebox]!CUDA/Kernel

BLOCKSHAPE=[ 4, 64 ]

CACHE=%CUDA{CU_FUNC_CACHE_PREFER_SHARED}

-> mykernel@concurrent(A,#n) [ A = #n ];

[/codebox]

to try the two different cache/shared memory options.

Also try changing (adding) the STREAM_POOL_SIZE configuration setting for Kappa in the kappa.conf file:

[codebox][/Kappa]

PROCESSES_INCREMENT=8

STREAM_POOL_SIZE=128

[/codebox]

it is safe to set the STREAM_POOL_SIZE large (2048 is fine for example) since real CUDA streams are only created if your program can make use of them.

You can try this using the free demo license of Kappa from psilambda.com. You can try your own kernels (assuming you have a GF100 class card). These scheduling scripts show you the JIT compiled attributes for the kernel as actually used by the GPU–this can be useful if you are trying to tweak your kernel to execute more concurrently.