Code optimization with CDP and dynamic shared memory allocation

Hello, Now I am optimizing my code and try to increase parallelism. My task is that: there’s 10,000 units that can be handled indepently. For now, I use cg groups, each block’s rank==0 thread fetch data from global memory to shared memory, then all threads in a block handle the same unit. So the launch config is like

<<< numOfUnits, 32 >>>

which means each unit has a block with 32 threads to work. Now the question is: each unit need differnet amount of shared memory. e.g. unitA needs 1KBytes while unitB needs 50Bytes. Now I use a max byte that meets all unit’s demands for example:

shared data[MAX_NUM];

Apparently, I waste a lot of shared memory, which can affect the parallelism.

To optimize, I am thinking about using CDP. To be more specific, in the parent kernel, I calculate each unit’s demand of shared memory, and launch many child kernel for each unit with precises needed shared memory in launch parameter. So There’s no waste of shared memory. But the down side is there are too many child kernel launch, for each unit has a child kernel. I think “cudaStreamFireAndForget” can mitigate this launching overhead.

So what’s your opinion? Is this optimization strategy valid? If not, what else I can do?

Why would CDP be necessary?

Calculate the needed shared memory for each unit, store those results in the beginning of shared memory, calculate offsets into shared memory for each unit. Synchronize all threads and then start normal execution.

Alternatively calculate locally for all units, then you do not need any (blocking) synchronization.

You can repurpose the shared memory layout by using different types for access.

1kByte per warp is not a very excessive use of shared memory. Without doing anything special, it would limit you to 48 warps per SM, i.e. 1536 threads, which isn’t a huge reduction from the max capacity of the SM and might very well match (ampere) or exceed (turing) the warp capacity of the SM. Therefore, before spending a lot of time predicated on this idea:

I would confirm that, first.

Whether you are using CDP or not, launching many small grids is not a good way to deliver work to the GPU.

Thanks for your reply.

I write a test code which launching many small kernels via CDP and it is unacceptable slow as you said.

I’ll confirm what you’ve mentioned and try another way, thanks!

See also here:

how to create pointers into any position of shared memory with various types.

thank you, I’ll check these links

Sorry for missing your comment, I only noticed the latest commment;

I think we may have a misunderstanding here: I want to assign different total amount of shared memory each block, since I start a kernel like

<<< unitNum, 32 >>>

and in kernel code I define a share memory array with a macro to set big enough space for all units. I’m aware that in some unit which only need a small portion of shared memory doesn’t need to access the reset of shared memory.

Previously, I was thinking of CDP may launch kernels with different amount of shared memory – to satisfy different unit’s need. It did, but caused unacceptable latency. I bring CDP is not about synchronization, it’s about dynamically allocing shared memory.

You can do the following.

For the case of:

Block 1 needs 2 KiB
Block 2 needs 1 KiB
Block 3 needs 4 KiB
Block 3 needs 10 KiB

Start with

kernel<<<(unitNum + 3) / 4, dim3(32, 4)>>>

each CUDA block comprises 4 (can/should be increased to more, e.g. 32) of your blocks, your blocks distribute the shared memory between themselves dynamically
→ the trick is that within the larger CUDA block it is possible to access more shared memory; your actual algorithm still uses 32 threads per (your) block
→ each SM just runs one CUDA block (optionally 2) at a single time for maximum flexibility

__global__ kernel()
{
    __shared__ volatile float shdata[64 * 256]; // e.g. 64 KiB
    __shared__ volatile int shsize[4];

    int block = blockIdx.x * 4 + threadIdx.y;
    if (block >= MAXNUM) // happens, if MAXNUM is not divisible by 4
        return;
    int thread = threadIdx.x;
    shsize[threadIdx.y] = calculateNeededSharedSizeInFloatsFor(block);
    __syncthreads();
    int offset = 0;
    for (int i = 0; i < block; i++)
        offset += shsize[i];
    volatile float* myshared = &shdata[offset]; // now you can access myshared[i] like a float array
}

Offset in block 0 is 0.
Offset in block 1 is 512 = 2048/4;
Offset in block 2 is 768 = 512 + 1024/4;
Offset in block 3 is 1792 = 768 + 4096/4;

To choose an individual data type, you can

  • use a different one from the start instead of float or
  • use reinterpret_cast (which strictly would be UB in most cases, but is normally accepted and compiled correctly by nvcc) or
  • use a union type.

Take care of correct alignment for your new data type and whether you want/need a volatile type.

If your kernel is very short and those offset calculations would be an undue burden, calculate those outside (in a separate kernel or on CPU) and just read the specific precomputed offset.

How do you do this?
Each thread (and each block) of a kernel runs the same source code and uses the same types, just with other data.
You cannot have a different macro resolution depending on block number.
This would not work with CDP either.

You can have differently sized shared memory regions, but the source code has to be the same. If it is just a few variants, of course you can put in all source code variants with if or switch…case.

If you just have a different overall size of an array per block, it would not be a problem. That can be accessed by the same source code.

Thanks for the detailed explanation and the demo code. To make sure I get your idea right, I summerize it:
I make each SM to run one CUDA block, every CUDA block gets 4 units and dynamically calculate offsets, do calculation for 4 unit in the same CUDA block.

I’m wondering the divergince, because each unit differs in return times. My previous method may have a fine-grained level, because for a block has finished a unit, it can exit immediantely and another unit/block comes in? In your one CUDA-block-for-4-units version, if unitA finish early, it has to wait until other units are finished?

yes, the macro is a fix value, every block gets the same amount of shared memory. That’s the crux, this value may be too big for some unit/block

Yes, you did.

Divergence is most critical within a warp. But you have a point about the CUDA blocks having to wait for the last of your blocks.

You have the following options

  • Simple solution: Try compromises, i.e. more than one CUDA block per SM, but still more than one block per CUDA block. E.g. if your architecture support 1536 threads per SM, you can have 48 blocks with 32 threads per SM for maximum occupancy. E.g. 8 CUDA blocks with 6 of your blocks each (as 8*6=48)?
  • Complicated solution: Instead of exiting, let the finished warps do another work package in a loop. That is complicated by the fact that the other work package perhaps needs a different amount of memory. You can take the maximum shared memory needed (by knowing, which tasks each of your blocks actually will do in the loop), you can sort the blocks according to memory requirements to combine tasks with similar memory requirements, etc. You can round up the memory requirements to better fit another block.

The simple solution should be fine in this case.

In general, it is okay for performance, if some warps (groups of 32 threads) exit early, as long as enough other warps are still resident. A good minimum number is 256 threads resident at any time (except at the very end). That fits with the 8 CUDA blocks per SM in the example of the compromise solution. After one CUDA block exits, another CUDA block starts. For each CUDA block at least one (of your) blocks of 32 threads is still active.

Also keep in mind, what @Robert_Crovella said about 1 KiB not being a lot. If your architecture has 96 KiB per SM and supports 48 * 32 threads as a maximum, then each of your blocks has on average 2 KiB of space. If your maximum needed memory is below that number, then just always reserve 2 KiB.
Many architectures only support 16 resident blocks (instead of 48). So even when keeping a fixed shared memory size of 2 KiB, you should consider combining some of your blocks into one CUDA block to increase the number of resident threads.

Sorry, I use the wrong word, I do mean the early exit of my block in CUDA block.

Now I’m using RTX4090. Thanks for your time, I think the simple solution will siffice too.

Yes, I understood and answered in that regard.
Strictly speaking it is a thread divergence (so correct words). However, divergence is mostly used in regards to intra-warp in the CUDA world.

The RTX 4090 supports (as maximum each)

  • 1536 threads per SM (32 * 48) and
  • 1024 threads per CUDA block and
  • 24 CUDA blocks
  • 99 KiB of shared memory per CUDA block

So to use 1536 threads, your CUDA blocks should at least contain 64 threads (1536 / 24 = 64) and on average your 32 threads should use 2 KiB of shared memory (2 * 48 = 96 KiB) to keep 48 of them resident at the same time.

If the 2 KiB is unrealistic, use a lesser number of threads per SM than 1536. Even with 256 threads per SM and 12 KiB on average (12 * 8 = 96) you should still be quite fine, keeping 8 blocks of 32 threads resident at the same time.

Give each CUDA block exactly the following amount of shared memory: average per your block times how many of your blocks per CUDA block.

BTW, could you give me some advice on how to learn to optimize cuda code in a systematic way? Now I’m a postgraduate student, only learn the basis usage of cuda via blogs and official docs. I’ve learned some trickes like cg groups when I’m using nvidia’s cuCollecitions by reading their source code, but is there any more efficient way?

Sorry, I do not know systematic ways.
There are the Nvidia learning materials or some independent blog posts (as you mentioned).
I would read the Programming Guide and Best Practices Guide, the PTX ISA. Try to use Compute Nsight and understand each section.

You may find this resource useful also.

thanks~

Thanks again, I really appreciate your time.