If you are on the Windows platform Nsight Visual Studio Edition CUDA profiler supports collection of source correlated counters that will accurately show you inst_executed, thread_inst_executed, not_predicated_off_thread_inst_exectued, branch_executed, branch_taken, divergent_branch_executed, and many memory statistics per SASS instructions. These counters values are rolled up to PTX and high level source code.
Fermi and Kepler architectures support counters for assessing the efficiency of your memory accesses to L1 and bank conflicts to shared memory. The nvprof metrics are:
shared_replay_overhead: Average number of replays due to shared memory conflicts for each instruction executed
global_replay_overhead: Average number of replays due to local memory cache misses for each instruction executed
global_cache_replay_overhead: Average number of replays due to global memory cache misses for each instruction executed
local_load_transactions_per_request: Average number of local memory load transactions performed for each local memory load
local_store_transactions_per_request: Average number of local memory store transactions performed for each local memory store
shared_load_transactions_per_request: Average number of shared memory load transactions performed for each shared memory load
shared_store_transactions_per_request: Average number of shared memory store transactions performed for each shared memory store
gld_transactions_per_request: Average number of global memory load transactions performed for each global memory load
gst_transactions_per_request: Average number of global memory store transactions performed for each global memory store
local_load_transactions: Number of local memory load transactions
local_store_transactions: Number of local memory store transactions
shared_load_transactions: Number of shared memory load transactions
shared_store_transactions: Number of shared memory store transactions
gld_transactions: Number of global memory load transactions
gst_transactions: Number of global memory store transactions
If you run the Visual Profiler memory analysis and any of the transactions per request values are high the analysis will provide you a link to the source line responsible for the memory operation.