I must admit, you have moved a bit fast in your post; so I am going to work with what I understand from your post
I am not sure whether it is even necessary to sequence; I also think that I would focus on the relationship between the post processing blocks (44k blocks?) to the processing blocks (144/ 84), to decide whether to complete the 144 (84) blocks first, before commencing with the 44k blocks
I shall give you a more practical example pertaining to the initial 144 block problem, and you can see how it impacts the now 44, 000 blocks/ kernels
To process the 144 blocks, you have the option of launching a single kernel, containing 144 blocks
Or, you launch more than 1 kernel, each in its own stream, and divide your blocks between the kernels
Assuming you can simultaneously run 14 blocks, you may launch 14 kernels, each with block count of 1; or, 7 kernels, each with block count of 2… any configuration that will leave you with 14 blocks running concurrently
Now, you want each block running on kernel launch to process (144 / 14) blocks
10 of the blocks running must process 14 blocks, and 4 must process 11 blocks
Do this to launch the kernels:
for cnt = 0; cnt < kernels_to_launch; cnt++
kernel<<<dG,dB,0,kernel_stream[cnt]>>>(cnt,blocks_to_process(cnt), block_offset(cnt)…);
And for the kernel itself, do this:
kernel(unsigned int cnt, unsigned int blocks_to_process, unsigned int block_offset,…)
{
if (i == 0)
shared blocks_processed = 0;
shared block_processing = 0;
while (blocks_processed < blocks_to_process)
{
[block_code]
if (i == 0)
blocks_processed++;
block_processing++;
}
}
Global memory unique to a block, should reference like this: global_memory[(block_offset * size) + (block_processing * size) + i);
To consolidate the prep kernel and pre-processing kernel (144 blocks):
kernel(unsigned int cnt, unsigned int blocks_to_process, unsigned int block_offset,…)
{
if (i == 0)
shared blocks_processed = 0;
shared block_processing = 0;
shared prep = false;
while (blocks_processed < blocks_to_process)
{
if (prep == false)
{
do_prep();
write_prep_data_to_shared_memory();
if (i == 0)
prep = true;
}
else
{
read_prep_data_from_shared_memory();
}
[block_code]
if (i == 0)
blocks_processed++;
block_processing++;
}
}