Why my global load efficiency always 50%

Well,I’m coding a copy function as follow:

global void copy(float *A, float *B, const int n)
{
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
B[i] = A[i];
}

There’re 16777216 numbers, and the gridDim is (32768,1),the blockdim is (512,1).
My GPU is 1080GTX.
I compiled it with -Xptxas -dlcm=ca, which means I use L1 cache.
I’m sure that my access to global memory is aligned.
Why is that the global load efficiency is always 50%, when I was doing nvprof --devices 0 --metrics gld_efficiency?
Is that the compute capabilitiy or what?
My boss want me to explain why.
Thanks for anyone could help.

What is your bandwidth utilization? I.e. (loaded_data + written_data)/time , in this case 2167772164 / time.

If its 50% then its not an unreasonable number.

Here is an example where we employed some different optimization strategies to get a high utilization:
https://devtalk.nvidia.com/default/topic/520792/speedy-general-reduction-sum-code-88-5-of-peak-updated-for-kepler-33-__shfl-etc-/

Thank you for you reply, Jimmy my friend.

I got a confusion about the increasing bandwidth utilization.

The 2167772164 can’t be change that I can’t change the blockDim or GridDim which the way you mentioned in your link.

So the only way to increase bandwidth is to shorten the time.

But when I compiled the code without L1 cache, The efficiency became 100%, however the running time was not significantly shortened. Not half of the time at least.

Isn’t it the only thing that is the global memory access pattern which affects global load efficiency that significant?

I don’t think it is bandwidth utilization problem you mentioned.

It also said in nvvp that:

if the code has surface loads then the metric will report lower values than actual efficiency?

what is this all about?

and what does it mean by “surface loads”?

Does anyone know?

I’ll appreciate that.

Please let me know what your kernel run-time is as reported by nvprof.

Some general hints to improve bandwidth utilization (no promises):

-> Perform maximum load size per thread of 256 bits (ie 16 bytes/load, your doing 4 bytes/load).
-> Change float -> float4
-> Try performing multiple loads per thread

here is a great nvidia resource:
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-increase-performance-with-vectorized-memory-access/