I noticed that:
Similar to the Volta architecture, the amount of the unified data cache reserved for shared memory is configurable on a per kernel basis. For the NVIDIA Ampere GPU architecture, the unified data cache has a size of 192 KB for devices of compute capability 8.0 and 128 KB for devices of compute capability 8.6. The shared memory capacity can be set to 0, 8, 16, 32, 64, 100, 132 or 164 KB for devices of compute capability 8.0, and to 0, 8, 16, 32, 64 or 100 KB for devices of compute capability 8.6.
So for A100, if I use 72KB shared memory in all, will 100KB be allocated to shared memory with 28KB idle memory?
Correct. If the sum of the shared memory allocated by threads blocks on the SM is 72 KiB and the SHMEM/L1 configuration is 100 KiB then 28 KiB of shared memory is unallocated.