Concurrently kernels running on one device

Wayne.ca · March 1, 2010, 9:18pm

When can CUDA support multiple kernels running on one device concurrently? Does this have been took into consideration by Nvidia? Plus, if more kernels could run concurrently, will some other sychronization mechanism like mutex and semphore be provided? Thanks.

tmurray · March 1, 2010, 9:25pm

it’s supported on Fermi.

avidday · March 1, 2010, 9:32pm

Right now, concurrent kernel execution isn’t supported at all. The next generation Fermi architecture is supposed to support multiple kernel execution, but there are no inter-kernel communication mechanisms that I am aware of.

Wayne.ca · March 1, 2010, 9:41pm

Thanks a lot for the reply :">

tmurray · March 1, 2010, 9:42pm

Inter-kernel communication is not supported under pain of me glaring at you really hard.

seibert · March 1, 2010, 10:15pm

Which reminds me: The programming guide only says that up to 4 kernels can run simultaneously on the compute 2.0 devices. There does not seem to be any control as to how chip resources are divided between the kernels. Is there some kind of scheduler in the driver that decides how to distribute queued kernels over the multiprocessors? Is it round-robin, load-balancing, something else? I assume we are guaranteed that a stream can’t starve, but probably nothing else.

CapJo · March 2, 2010, 10:50am

Another question is the cache and how it’s handled?

Is it only valid within a kernel or also across serveral calls of the same kernel? This is very important if you need global synchronization … then you have to call your kernel multiple times and it would be very bad if the cache would be invalidated for every new kernel?

Are there information about this?

norris.j · March 2, 2010, 11:23am

In the programming guide it has a section about using streams, and providing the stream to the kernel execution as the fourth parameter:

<<<blockx, blocky, 0, streamID>>>

Apparently under the basis that kernels from different streams could execute concurrently. Is this just placeholder code for the Fermi architecture then? I implemented it & it made a really insignificant performance gain, so I’m guessing it is serializing in the background.

This relates in my mind to the matrix multiplication example in the guide, is the example merely a demonstration of how to use shared memory with no reason to use it in the real-world? In the real world I would want to multiply x thousand/million matrices at the same time, in which case shared memory is useless, there’s not enough of it. And if kernels cannot execute concurrently serializing the CUDA multiplications is no better than doing all on the CPU? This is continually the problem I’m having, I cannot find any scenario to use shared memory, because the block sizes of my kernels are too large. Lets say I’ve got a bayer conversion on an image, image is 1024x768 pixels = 786432 kernel “iterations”. That leaves less than 1 byte per pixel of shared memory… ?

So I start to think maybe I should have lots of smaller serialized kernels like the matrix multiplication, but then I would have to construct a framework to subdivide the image into serialized sections, small enough to utilize shared memory, which just doesn’t seem right on a processor that can support so many parallel operations. Plus the CUDA examples (denoise) seem to do things like I have.

James

avidday · March 2, 2010, 11:29am

Streams has been around for a long time, and the main purpose is to enable asynchronous copy operations and kernel/copy overlap (there is a simpleStream example in the SDK which illustrates how it works). NVIDIA have adapted the existing streams API to permit concurrent kernel execution with Fermi, but that wasn’t the original intention (or maybe it was, but the hardware just lagged 5 years behind the API design…)

_Big_Mac · March 2, 2010, 11:57am

You can’t launch a block with 1024x768 threads anyway.

Parallelism in CUDA (and OpenCL for that matter) should be considered at two levels - blocks and threads within a block. You can’t write a really well performing app that runs only a single block. One block will map to a single multiprocessor in hardware and a modern GPU has anywhere between 16 to 30 of them. You want hundreds if not thousands of blocks (more than # of MPs for latency hiding and scaling over future architectures).

So in your case you could partition the image into, say, 16x16 patches (256 threads within a block, lots of per-block shared memory).

With Fermi, there will be a third optional level of parallelism that you’ll be able to consider - kernel concurrency. This is orthogonal to what I’ve written before, you still won’t get to write an effective single block kernel for processing 1024x768 images.

norris.j · March 2, 2010, 12:11pm

Yep I subdivide 16x16 blocks, width/16 and height/16 = number of threads. That doesn’t really affect my point about shared memory, because it’s still shared over the entire block/thread subdivision right? Or are you saying that the threads are serialized within the block, and I can guarantee a thread runs from start → end without swapping to another thread in the same block mid-execution?

I realize this is a rather naive/basic question, gotta start somewhere though ;)

Thanks,

James

You can’t launch a block with 1024x768 threads anyway.

Parallelism in CUDA (and OpenCL for that matter) should be considered at two levels - blocks and threads within a block. You can’t write a really well performing app that runs only a single block. One block will map to a single multiprocessor in hardware and a modern GPU has anywhere between 16 to 30 of them. You want hundreds if not thousands of blocks (more than # of MPs for latency hiding and scaling over future architectures).

So in your case you could partition the image into, say, 16x16 patches (256 threads within a block, lots of per-block shared memory).

With Fermi, there will be a third optional level of parallelism that you’ll be able to consider - kernel concurrency. This is orthogonal to what I’ve written before, you still won’t get to write an effective single block kernel for processing 1024x768 images.

_Big_Mac · March 2, 2010, 12:38pm

Shared memory is counted per block, ie. you have 16KB per block.

When you declare

__shared__ float array[32];

array is allocated once per block (not per thread as the usual code semantics would suggest). Each thread within a given block will see the same array. It is your responsibility to handle any race conditions you might encounter due to shared memory.

Threads from a given block run in parallel on a multiprocessor in groups of 32 called warps. Warps can get timesliced (ex. when they stall waiting for a memory fetch). You should not assume any particular order in which blocks (and warps within blocks) are scheduled for execution.

norris.j · March 2, 2010, 12:59pm

Excellent explanation, thanks enormously! I’ll have a re-think of the problem & try again.

James

Shared memory is counted per block, ie. you have 16KB per block.

When you declare
__shared__ float array[32];
array is allocated once per block (not per thread as the usual code semantics would suggest). Each thread within a given block will see the same array. It is your responsibility to handle any race conditions you might encounter due to shared memory.

Threads from a given block run in parallel on a multiprocessor in groups of 32 called warps. Warps can get timesliced (ex. when they stall waiting for a memory fetch). You should not assume any particular order in which blocks (and warps within blocks) are scheduled for execution.

gshi · March 2, 2010, 3:15pm

seibert, could you point me to the page in programming guide about the concurrent kernels support for compute 2.0 device?

I cannot find it in the document myself. And I remember in Fermi white paper, it is said to support up to 16 concurrent kernels.

avidday · March 2, 2010, 3:21pm

It is on page 35 of the February 2010 version:

Wayne.ca · March 2, 2010, 4:44pm

External Media External Media can you do something to change this state? I really busted my ass off trying to code a producer-consumer program on CUDA. But it seems to be mission impossible.

seibert · March 2, 2010, 6:17pm

It would require some kind of efficient signaling mechanism between kernels, which definitely is outside the current CUDA programming style. (If you start thinking about how you would define such a mechanism in a SIMT model so that it would both make sense and also not always have terrible performance, you discover that even defining what you want is non-trivial.)

That said, producer-consumer is a convenient way to achieve coarse-grained parallelism. It is just hard to map to a data-parallel architecture, where you usually collapse the producer-consumer chain into one task, but then apply it to multiple inputs at once.

Wayne.ca · March 2, 2010, 6:50pm

Well… thanks a lot for the reply :rolleyes:

Topic		Replies	Views
Concurrent kernels execution using streams in multiple CPU threads CUDA Programming and Performance	7	10597	June 26, 2012
Can threads in a warp from different blocks? CUDA Programming and Performance	17	11816	March 26, 2010
Kernel scheduling with Fermi independent blocks can be placed in new streams? CUDA Programming and Performance	14	13202	January 22, 2010
Fermi speculation Kernel invocation in kernel code CUDA Programming and Performance	10	4294	October 20, 2009
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19534	July 5, 2011
Kernels launch - parallel or serial? CUDA Programming and Performance	16	6812	January 11, 2010
Concurrent Kernel Execution on Fermi - confussion CUDA Programming and Performance	13	1648	October 10, 2011
CUDA 3.0: concurrent kernel launches CUDA Programming and Performance	9	17724	April 1, 2010
concurrently running blocks from multiple kernels on the same SM related to Fermi and unified shader CUDA Programming and Performance	14	3856	November 30, 2010
Concurrent Kernel Execution and Context switching Problem CUDA Programming and Performance	11	8277	July 8, 2015

Concurrently kernels running on one device

Related topics