What does the "shared_efficiency" really mean?

I wrote a cuda program to accelerate matrix production.

Profile using nvprof, got following results:

gst_efficiency: 100%
gld_efficiency: 100%
shared_efficiency: 30.77%
shared_st_bank_conflicts: 0
shared_ld_bank_conflicts: 0

I wonder why the shared efficiency is not 100% when there is no bank conflicts exists.

Is it caused by the broadcast when threads access the same address in SMEM?

Anyone know how to achieve 100% shared_efficiency?

I am sure this is bank conflicts not back conflicts ;)

To be honest I have no idea how shared efficiency can be low without showing any bank conflicts.

Maybe the bank conflict performance counter was not available during the profiling run, or not enabled?

Do any of these performance counters show > 0 values?

shared_replay_overhead
shared_load_replay
shared_store_replay

Thanks for the correction.

But there is no such metrics in cc6x devices, as described below.

https://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference-6x

shared_efficiency = Ratio of requested shared memory throughput to required shared memory throughput expressed as percentage

The numerator is collected using a shader patch to determine number of bytes requested. This takes into account if threads are active (and I believe predicated true).

The denominator is the total number of cycles data is returned from shared memory x width of the interface.

On Kepler architecture the shared memory return path width is 256B which can only be achieved if the kernel is run in 8B bank mode and 64-bit or greater accesses are executed. If the kernel is executed in 4B bank mode the maximum efficiency may be limited to 50%. On Maxwell - Turing architectures the bank mode is fixed and all instruction widths should be able to achieve full efficiency.