I’m relatively new to CUDA and I have a question about current best practices in using shared memory, in particular with the use of dynamic shared memory variables. The question has two parts:
Part 1:
Is it still best practice to define one shared memory variable and then point into that variable at different positions in its allocated memory to carve out multiple variable spaces?
I ask because most of the references I can find online use a pointer method to carve out variable spaces from a single larger shared memory allocation, BUT, the newest edition of Kirk and Hwu’s book (Programming Massively Parallel Processors - 3rd Ed 2017) makes a reference to multiple variables on page 98.
The Kirk and Hwu example is as follows (noting the missing variable data type in their example):
extern shared Mds;
extern shared Nds;
While the shared memory space for the Kernel is defined in the usual way.
Part 2:
Using the pointer method is fine but it clearly limits the variable to a data type. Are there any methods available to mix types, or is it best practice to force variables through casting?
For example, define one shared variable as a data type float and use a point method to carve out multiple spaces as needed. If one of the spaces is used to hold a data type int, simply cast the int to a float to store it.
You can have multiple extern shared declarations, but those all alias to the very same memory pointer. This is confusing, but useful sometimes in that your multiple declarations could have different types and you can avoid casting each time you access one type versus another. This might be useful for example if you use some temporary integer array in shared memory to set up the first stage of your compute, then you _syncthreads() the block, and for the second half of the block’s compute you don’t need the shared integers, but do need a shared float array which you initialize and use after the _syncthreads().
But yes, if you need many dynamic shared variables of mixed types at runtime, you’ll end up casting.
And you can mix types just by casting with offsets. The programming guide gives a common example.
This is only for dynamic shared memory. Statically defined shared memory variables don’t need to have such casting games.
I don’t know what that Kirk and Hwu example is trying to do. I don’t have the book so I can’t judge further, but as you point out the missing typename is already a mistake.
One thing that the example from the programming guide does not show, but I would consider a best practice, is to order the multiple allocations created from dynamic shared memory by element-type size (largest first), to avoid misalignment issues when casting pointers to create a base address for subsequent arrays (at least the Programming Guide points out that risk).
Register blocking is certainly eminently worthy of consideration with today’s register-rich GPU architectures.
I have not paid attention to it in a while, but it used to be the case that CUDA allowed programmers to re-configure the shared memory to a minimum size (though not to zero), making the balance of the memory available for L1 caching:
“Applications no longer need to select a preference of the L1/shared split for optimal performance. For purposes of backward compatibility with Fermi and Kepler, applications may optionally continue to specify such a preference, but the preference will be ignored on Maxwell, with the full 64 KB per SMM always going to shared memory.”
When considering register resources on a thread basis. And when carving shared memory into smaller subregions to hold multiple shared variables, do the pointers for those regions get included as a part of the thread the register count?
Yes they do. They form normal address calculations like any other pointer arithmetic. There are no special mechanisms involved other than those allocating transparently one contiguous shared memory region to each block.
Note that the compiler may optimize pointers to shared memory by using a memory-space specific pointer that can fit into a single 32-bit register, as the shared memory address space is known to be smaller than 4 GB. On 64-bit targets (used by the vast majority of CUDA apps at this point, I would think), a generic pointer requires 64 bits (= a register pair).
Pointer assignments to registers are optimized based on live-ness just like all other data, so registers are only allocated while a pointer is is use.