Problems about consolidated memory access

When I access memory in my kernel like this:
global void mykernel( int *A ) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int a = A[tid];
//…
}
It will access memory in consolidated way, meaning that every warp will send one access request and 128bit consecutive data will be read once.
But if I access memory in my kernel like this:
global void mykernel1( int *A ) {
int a = A[0];
//…
}
How many access requests will be send in a warp? Once or 32 times?
If all the threads in my kernel will access the same memory address ( not constant ) at the same time, what’s the best way to do this?
Thanks!

When all threads in a warp access the same memory address, this is called uniform access. All GPUs supported by CUDA 7.0 or 7.5 have broadcast mechanisms so that such accesses only require 1 memory transaction. The results of that transaction are then “broadcast” to all threads that need the data in a warp.

So this requires 1 memory transaction (per warp):

int a = A[tid];

And this requires 1 memory transaction (per warp):

int a = A[0];

In this case:

int a = A[0];

the caches will tend to benefit other warps (L1/L2), or other threadblocks (L2) when they execute the same instruction. Since you say the data is not constant, that is probably the best scenario. If the data were constant, then the use of constant memory or perhaps the read only cache (const restrict on cc3.5 and above) might be useful.

Thank you very much!