I made two programs for testing.
- read 2048 bytes and add it and return back the result.
- read 1 byte if threadIdx % 2 == 0
read 2048 byte if threadIdx % 2 == 1
I calculate bandwidth as reading byte / elapsed time.
I use 768 * 16 threads. (dimgrid = 16, dimblock = 768)
I though that the program 1’s performance must be greater than program2.
Because the thread which read 1 byte has to wait the thread which read 2048byte until it ends.
But the performance is similar, sometimes program2 is better.
I cannot understand it with my knowledge. Can you help me?