Is it similar to global one or not?
How to set the number of threads of blocks? Previously, I set only one thread with only one block and the result seemed to be unreasonable. And if I set a large block size, there would be some latency overlapping and it would impact the result
So I’m confused about the setup
from someone’s program which is used to measure global mem latency, he/she used start_time ^= value[idx], end_time=clock() in every loop, how do these work?