Each kernel generates a set of information. am not sure why do you have so many. I would say you have 7 kernels. Are you using some cuda library in addition to your code?
The number of registers is per thread while the shared memory is per block. This numbers can help you to find out for a given launching configuration the resources needed and the maximum theoretical occupancy .
The libraries are usually composed of many kernels and you get information for each of the individual kernels.
The shared memory is per block. An MP can have more than 1 active block. For architecture 3.5 you can have 2048 active threads per MP (maximum number of threads per block is 1024). If your block size is 512 this means you need to optimize the usage of registers (allow spilling maybe) and shared memory to be able to have 4 blocks on one MP for maximum occupancy.
For the ready made libraries I would not worry. These should already be optimized.