Analyzing kernel performance Interpreting nvcc and profiler output

When I compile my kernels with --ptxas-options=-v, I get the following output

ptxas info : Compiling entry function ‘_globfunc__Z32cuda_calculate_variable_messagesPfPvPjPiS2_S0
ptxas info : Used 15 registers, 1872+1868 bytes smem, 72 bytes cmem[1]
ptxas info : Compiling entry function ‘__globfunc__Z30cuda_calculate_factor_messagesPfS_PvS0_PjPiS0_ii’
ptxas info : Used 24 registers, 1116+1108 bytes smem, 32 bytes cmem[1]

Does this mean that my kernels use 1872 and 1116 bytes of shared memory respectively, or 3740 and 2124 bytes of shared memory respectively? What does the “+” mean?

Also, when I turn on CUDA_PROFILE and record gld_incoherent and gld_coherent, I get way more incoherent than coherent loads, whereas I would expect the opposite. Am I correct in interpreting these as the total number of loads in an average warp? If one thread performs a load and the rest do not, does this count as divergent? What about if half of the threads access consecutive addresses in global memory, while the other half of the threads do nothing? The manual suggests that such an access pattern should mostly coalesce on all architectures (unless I"m misremembering). I don’t suppose there’s some clever way to figure out which loads are diverging…