Which nvprof metrics related to memory loads are most useful in profiling?

Been using nvprof to profile a mostly memory bound application and am trying to determine the relevant metrics;

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference

In general the application is limited by the somewhat random reads from sections of a 2.4 GB input buffer which I am attempting to order and force into fast read-only cache.

There are so many metrics available that I am hoping to reduce the possibilities down the a small subset which gets at the heart of the non-coalesced read issue.

One metric I already ran was L2 Cache Utilization which looked like this;

==6552== Profiling result:
==6552== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX TITAN X (0)"
        Kernel: compress_y_half(float const *, __half2*, int)
          8                            l2_utilization                      L2 Cache Utilization     Mid (5)     Mid (5)     Mid (5)
        Kernel: sum_buffers_512(float4 const *, float4*)
          1                            l2_utilization                      L2 Cache Utilization     Mid (4)     Mid (4)     Mid (4)
        Kernel: simple_back_512(float const *, __half2 const *, float2*, float, float, int, int)
         64                            l2_utilization                      L2 Cache Utilization     Mid (5)    High (8)     Mid (6)
Device "GeForce GTX TITAN X (1)"
        Kernel: compress_y_half(float const *, __half2*, int)
         15                            l2_utilization                      L2 Cache Utilization     Mid (5)     Mid (5)     Mid (5)
        Kernel: simple_back_512(float const *, __half2 const *, float2*, float, float, int, int)
         60                            l2_utilization                      L2 Cache Utilization     Mid (5)    High (8)     Mid (6)

Fortunately my main kernel has the best average result of 6 , but still there is much room for improvement.

In general I would like to be able to profile new versions of the kernels using a small set of metrics which apply to optimizing loads, many of which are conditional loads.

My occupancy, compute, and writes are as good as they can get (as far as I can tell), so need to perform surgery on the disparate loads.

Any advice on the which metrics to use and how to use the results would be appreciated!

If you want to assess non-coalesced reads, I would start with gld_efficiency for global loads, and shared_efficiency for shared loads. Texture accesses must be considered separately.