how to gather accessed memory addresses


I’m new in CUDA and this forum, and sorry for my awful English.

I want to collect information about what addresses the threads accessed, to analyze the memory access pattern using dynamic information.
In order to do that, I simply insert printf function to output the accessed addresses.

__global__ func(float *A, float *B, float *C, const int N) {
  if (tid < N) {
    C[tid] = A[tid] + B[tid];

    printf("thread %d accessed %p."\n, tid, &C[tid]); // print addresses regarding with the array C

For example, consider the situation where I could get data below.

thread 0 accessed addr0.
thread 1 accessed addr1.
thread 2 accessed addr2.
thread 3 accessed addr3.
thread 31 accessed addr31.

And then, I calculate the distances between the address accessed by the thread 0 and by each thread (addri - addr0) and normalize those, thus I can get access pattern about warp 0 (of block 0 of grid 0) as , for instance, [0, 1, …, 31].
And finally, using this vector I can recognize the access pattern (at least regarding with this warp) is a coalesced access (, because each value is incremented by one from lane 0 to lane 31) .

Is this way to analyze the memory access pattern correct ?
And are there any other appropriate ways to do that ?