weird performance in GPU

I made two programs for testing.

  1. read 2048 bytes and add it and return back the result.
  2. read 1 byte if threadIdx % 2 == 0
    read 2048 byte if threadIdx % 2 == 1

I calculate bandwidth as reading byte / elapsed time.
I use 768 * 16 threads. (dimgrid = 16, dimblock = 768)
I though that the program 1’s performance must be greater than program2.
Because the thread which read 1 byte has to wait the thread which read 2048byte until it ends.
But the performance is similar, sometimes program2 is better.
I cannot understand it with my knowledge. Can you help me?

If you are on a compute capability 2.x device, both kernels will be memory bandwidth limited. Thus it does not matter how the results are processed once they arrive on the SM.

uhm, i don’t know what you mean. memory bandwidth is about 120GBps in the 2.x gpu.

but i cannot get the bandwidth like that. so i think it is not the problem of memory bandiwdth limit.