2 kernel launch or one kernel with multiple blocks reading global memory?

I have code which currently launches 2 kernels and would like to consolidate them to one, if it will run faster.

The current config is the first kernel is 1 block x 30 x 34 threads and the second kernel is 144 blocks of 30 x 34 threads. The 2nd kernel depends on the results from the first.

What I am trying to do is merge the 1st kernel block into each of the 144 blocks of the 2nd kernel, but that means all those blocks would be accessing the same global memory in parallel.

In general would it be faster to have 2 kernel launches, or 1 kernel launch which accesses global memory in parallel?

First off, you shouldn’t use a thread number that’s not a multiple of the warp size if you can avoid it. In your case, you’re going to have Two warps of 64 total threads, but the last 30 threads are not doing anything. To answer your question, it sounds like you’re already hitting the same global memory spot, right? If you’re simply launching a second kernel, I’m assuming all of the blocks are hitting the same memory location when it runs, so you’re going to have a lot of instruction replays. How much data are you dealing with? If you’re not modifying the data you may be able to get some performance improvement by copying it into the constant cache, but that’s limited to 64KB.

I would have thought that, given that you have 2 kernels, with the latter dependent on the results of the former, that you would permit this via global memory, regardless - I fail to see a significant difference in global memory access when changing from 2 kernels to only 1…?

My view would be that much depends on the complete program cycle - if the 2nd kernel does not loop, you are likely to exchange savings in overhead at the kernel level for overhead at the sub-kernel level, and it would likely be cleaner to leave the kernels separate
You are not very explicit as to the reason for the 1 : 144 block ratio - I believe this may be key determining factor

On the other hand, merging the 2 kernels may imply data reuse via shared memory, and this may have speed benefits

First, thank you for responding.

The first block has 30x34 threads for a total of 1020. This is sized for the data input. In fact the data had to be reduced from 30x35 in order to fit in the block and use shared memory for reduction. I know it could handle the larger data size with greater complexity of code.

Each of those 1020 threads accesses global memory for immediate transfer to shared. If that block is redundantly implemented in each of the 144 blocks of the 2nd kernel, all 144 blocks would be accessing the same global memory on launch.

Are you saying that only one block should transfer the 1020 global memory floats to constant memory and the rest should read from there? And that will run faster? That would involve synchronization between blocks, which is what I already have with the 2 block launch.

I know that launching 2 kernels takes longer than one. What I am uncertain about is how much delay is incurred when multiple blocks are reading the same global memory, instead of one block.

BTW, the 144 blocks of the 2nd kernel still have to access other global memory space, just not ALL the exact same addresses, except for the global memory that is prepped by the 1st kernel; each of 144 blocks, still has to read that data, but it is reduced in size by the first kernel.

So I guess there is a lot of parallel global mem reading going on, either way.

The difference is how many blocks will be reading the exact same memory addresses.

I want to consolidate the kernels in order to make concurrent execution cleaner using streams.

The prep for each of the 144 blocks is identical and fits in one block.

When merged, both (previously separate) blocks can use the same shared memory. I assume that saves some shared memory allocation time?

As posted above, I believe the key is how much delay is incurred when multiple blocks read the same global memory addresses.

I would think that there are methods to duplicate the same array in memory a number of times
For a large thread block sharing the same shared memory, this may result in speed gains
But for global memory this would be pointless - reading the same array in global memory a number of times should be the same as reading the same array duplicated a number of times

At the same time, why would you be slowed down that much? Given the stipulated size of the thread blocks, only a number of blocks would run concurrently, and you would need multiple GPUs to run all 144 blocks simultaneously, not so?

Again, based on the size of a thread block, I interpret merging the blocks to mean duplicating the prep within each of the 144 blocks - exchanging post-prep global memory for pre-prep global memory; more or less the same in magnitude I would think

So I guess the crux of the issue is, does CUDA serialize reading of the same global memory addresses by different blocks executing concurrently, or can they read (and coalesce) in parallel?

The large part of my prep kernel is reducing 1020 floats down to 4 ints plus a different 1020 floats down to 1 float (mean), which is a substantial consolidation of data.

In addition to using 2 kernels, you could very well consider dynamic parallelism to have the device rather than the host carry the overhead of launching the 2nd kernel
And as shaklee3 pointed out, you have the option of constant cache

Still, I fail to see a significant speed difference in passing the values to the blocks via shared memory versus via global memory
In my view, global memory bandwidth - not only bus width, but also frequency - makes up for global memory latency; blocks would wait for global memory, but all blocks can have their global memory requests serviced in close array, even if the requests are serialized
Having the overhead the one way or the other would likely not affect overall execution time much, regardless of the execution time of the 2nd kernel
The overhead should represent a relatively short time; so, if the 2nd kernel takes a relatively short time to execute, it would still imply a relatively short overall execution time
If the 2nd kernel takes a relatively long time to execute, the overhead would represent a small portion of overall execution time
Not so?

Well, I finally implemented both approaches, and it appears to take 10-15% longer to do redundant data reduction in one kernel vs 2 kernels; 1 for data reduction and another for processing.

So whatever is gained by launching only one kernel, is completely lost, plus more, by multiple blocks trying the read the same global memory in parallel.

I will have to further investigate the constant cache issue.

What is the average calculation time of (a block of) the 2nd kernel, do you know?

If you are truly serious about consolidating your kernels, perhaps also consider the following approach; it may or may not align with what your kernel is supposed to do, you would know best

One - the conventional - design philosophy is to have one thread block do the work pertaining to one block
Its exact counter design philosophy would be to NOT have one thread block do the work pertaining to one block - to have a thread block do the work pertaining to multiple blocks

This is how you implement:
In your particular case, you have 144 blocks (as part of the 2nd kernel), or the equivalent of 144 blocks of work
Predetermine how many blocks you would be able to run simultaneously - my GPU has 15 SMs and if I read your shared memory requirements correctly, my GPU would be able to run 15 of your blocks simultaneously
Divide the number of blocks of work you have up between the number of blocks you can run simultaneously - in your case, 144 / 15; do not throw away the residual, you have to distribute it among the blocks
Launch as many (identical) kernels as blocks you can run simultaneously in separate streams - in your case, launch 15 kernels in separate streams
Each kernel only has a grid size (number of blocks) of 1
To each kernel, pass an index and block count - you can easily derive the index from the for loop you are to use to launch the multiple kernels; the block count is the number of blocks you wish the kernel to process
Each kernel’s block would use the index to determine the block offset of the blocks it must process, and each kernel’s block would loop until it has processed all the blocks to process required of it

The approach may pose benefits when you have a) a large number of blocks to process, b) the possibility of data reuse between blocks, c) very conditional algorithms to implement

In your case, for example, you would only need the prep data 15 times, not 144 times, as you can now rely on shared memory more extensively - multiple blocks now have access to the same shared memory

And it can be extended to 2d and 3d grids

The whole 2 kernel implementation takes ~1.8ms on a Titan; I imagine the great majority of that is for the 2nd kernal. I guess each block takes 1/144th of that.

I was not aware that multiple blocks can access the same shared memory. How does that work?

ETA: OK I read your comment again and understand you’re suggesting to process the blocks serially using just the number that can execute simultaneously to process in parallel.

Thanks for the suggestion.

Multiple blocks can not access the same shared memory; but if a block process the work of multiple blocks, they can - but this would be a form of sequential access rather than concurrent access
Still, in many cases this is sufficient

I would think that a block takes 15/144 * 1.8ms, seeing that 15 blocks can likely run simultaneously (or 14, how many SMs per titan - 14 or 15?)

Perhaps you might also see speed improvements, if you can manage to bring you block size down

Based on your suggestion, and the fact that I need to launch thousands of this sequence of kernels, I’ve decided to try to consolidate these 2 kernels with the post processing kernel, so that the middle (144 block) kernel processes all 144 blocks in serial (actually only 84 as the lower diagonal is redundant). Processing in serial simplifies the indexing so that I’m not launching redundant blocks (for simpler coding) which just exit.

It now seems that the best approach for launching 44,000 of these (consolidated) kernels is to create 14 streams and sequence through them 14 at a time so they can execute concurrently (presuming that all 3 will fit in one kernel).

Can I just dump them all on the card and CUDA will queue them up?

Or do I need to manage my own queues with stream events?

I must admit, you have moved a bit fast in your post; so I am going to work with what I understand from your post

I am not sure whether it is even necessary to sequence; I also think that I would focus on the relationship between the post processing blocks (44k blocks?) to the processing blocks (144/ 84), to decide whether to complete the 144 (84) blocks first, before commencing with the 44k blocks

I shall give you a more practical example pertaining to the initial 144 block problem, and you can see how it impacts the now 44, 000 blocks/ kernels

To process the 144 blocks, you have the option of launching a single kernel, containing 144 blocks
Or, you launch more than 1 kernel, each in its own stream, and divide your blocks between the kernels

Assuming you can simultaneously run 14 blocks, you may launch 14 kernels, each with block count of 1; or, 7 kernels, each with block count of 2… any configuration that will leave you with 14 blocks running concurrently

Now, you want each block running on kernel launch to process (144 / 14) blocks
10 of the blocks running must process 14 blocks, and 4 must process 11 blocks

Do this to launch the kernels:

for cnt = 0; cnt < kernels_to_launch; cnt++
kernel<<<dG,dB,0,kernel_stream[cnt]>>>(cnt,blocks_to_process(cnt), block_offset(cnt)…);

And for the kernel itself, do this:

kernel(unsigned int cnt, unsigned int blocks_to_process, unsigned int block_offset,…)
if (i == 0)
shared blocks_processed = 0;
shared block_processing = 0;

while (blocks_processed < blocks_to_process)

if (i == 0)

Global memory unique to a block, should reference like this: global_memory[(block_offset * size) + (block_processing * size) + i);

To consolidate the prep kernel and pre-processing kernel (144 blocks):

kernel(unsigned int cnt, unsigned int blocks_to_process, unsigned int block_offset,…)
if (i == 0)
shared blocks_processed = 0;
shared block_processing = 0;
shared prep = false;

while (blocks_processed < blocks_to_process)

if (prep == false)

if (i == 0)
prep = true;



if (i == 0)

I really appreciate your suggestions. Thank you.

However, since I have so many of these to process (44K), I now believe that the best way to keep the card fully loaded is to consolidate my 3 kernels (the preprocessing data reduction kernel, the 144(84) block kernel, and the post processing kernel not mentioned previously) into 1 kernel with 1 block, processing the previous 144(84) blocks in serial.

I then have to launch 44K of these.

My thinking right now is to create 14 streams and sequentially launch these 44K kernels, cycling through the streams. This is my first experience with streams.

I just don’t know whether I can just dump them on the card and CUDA will manage the stream queues, of if I need to manage the stream queues with events. Some of these kernels will finish rapidly doing nothing and others will take longer, depending on the data reduction checks in what was previously the first kernel.

I apologize if my description is too cursory or cryptic.

Streams are much like pipes - you fill from above, and cuda (the gpu) draws from the bottom
Streams form an excellent way to better order or streamline your work - work in the same stream execute sequentially - in series - but work across streams can execute concurrently (there are conditions however)

You would then have to monitor the individual streams, and re-launch kernels as the streams complete their work (using callbacks or stream events I believe), or you would have to sub-divide the kernels between from the start and launch multiple kernels per stream from the word go
You could also merely launch 1 kernel with 1 block in 14 streams, but have each process 44K / 14 in sequence, again by passing block count and offset data into each kernel

The 144 block example forms an excellent demonstration of what you could expect; you could extent it to the 44k block case:

(Kernel 1 of) stream 1 processes block 1, then block 2, then block 3,… then block 10, and terminates
At the very same time, (kernel 2 of) stream 2 processes block 11, then block 12, then block 13,… then block 20, and terminates
At the very same time, (kernel 3 of) stream 3 processes block 21, then block 22, then block 23,… then block 30, and terminates

At the very same time, (kernel 14 of) stream 14 processes block 134, then block 135, then block 136,… then block 144, and terminates