Fermi and Kepler architectures support counters for assessing the efficiency of your memory accesses to L1 and bank conflicts to shared memory. The nvprof metrics are:
shared_replay_overhead: Average number of replays due to shared memory conflicts for each instruction executed
global_replay_overhead: Average number of replays due to local memory cache misses for each instruction executed
global_cache_replay_overhead: Average number of replays due to global memory cache misses for each instruction executed
local_load_transactions_per_request: Average number of local memory load transactions performed for each local memory load
local_store_transactions_per_request: Average number of local memory store transactions performed for each local memory store
shared_load_transactions_per_request: Average number of shared memory load transactions performed for each shared memory load
shared_store_transactions_per_request: Average number of shared memory store transactions performed for each shared memory store
gld_transactions_per_request: Average number of global memory load transactions performed for each global memory load
gst_transactions_per_request: Average number of global memory store transactions performed for each global memory store
local_load_transactions: Number of local memory load transactions
local_store_transactions: Number of local memory store transactions
shared_load_transactions: Number of shared memory load transactions
shared_store_transactions: Number of shared memory store transactions
gld_transactions: Number of global memory load transactions
gst_transactions: Number of global memory store transactions
If you run the Visual Profiler memory analysis and any of the transactions per request values are high the analysis will provide you a link to the source line responsible for the memory operation.