Here are some examples that don’t use shared: 1 2 3
Those examples are using “register for mma”. Of course register data has to come from somewhere. So if you want to, you can load data into a register from global, or local, or shared, and then pass those registers directly to the PTX mma instruction, more-or-less as the examples I linked indicate.
wmma is the earliest exposure of tensorcore ops, e.g. in the v100 timeframe. As tensorcore added variety, a new instruction format (mma) was added. An example of a difference is given here