Dependent global memory reads

Hello all,

I have two questions on global memory reads (please apologize if these questions have been answered before; feel free to point out any existing thread):

1] Considering uncoalesced global memory reads, is there any reason for which dependent memory accesses ( v0 = G[i]; v1 = G[v0]; ) should perform slower than non dependent memory reads (v0 = G[i]; v1 = G[j]; ) ?
In other words, is the hardware able to somehow speed up two successive non dependent global memory reads? (uncoalesced)

2] If all 32 threads perform non coalesced reads (ie. completely random access); what happens? Is this as slow as having a single thread perform 32 reads in sequence?


  1. Performance with the dependent reads will be significantly lower. The reason is that the second read will not start until the first read is completed. Two independent reads will be overlapped (pipelined).

  2. For detailed explanation what happens with uncoalesced memory reads please refer to Section of the Programming Guide.

Entire warp performing an uncoaleced read should be faster than a single thread performing 32 separate reads for a number of reasons. First, it’s one assembly instruction, rather than multiple ones, so serialization is done by the memory hardware and not your code. Also, you should get better overlap of the memory transactions, as there is a limit on how many independent reads will be pipelined from a single thread. This limit depends on the compiler, as the more independent reads you have in flight, the more registers your code requires.


Thanks for your answer. What I don’t understand is that I cannot reproduce this behaviour (which I need :-) )

Here is the code I am using to compare both:

// CPU side:

#define TABLE_SIZE 8

int values[TABLE_SIZE] = {

1, 6, 7, 2, 5, 3, 4, 0, };

// GPU side:

__device__ int non_dependent(int *_values) 

{ int i = 0; int v = 0;

 i = _values[ 0 ]; v += i;

 i = _values[ 1 ]; v += i;

 i = _values[ 6 ]; v += i;

 i = _values[ 4 ]; v += i;

 i = _values[ 5 ]; v += i;

 i = _values[ 3 ]; v += i;

 i = _values[ 2 ]; v += i;

 i = _values[ 7 ]; v += i;

return (v); }

__device__ int dependent(int *_values) 

{ int i = 0; int v = 0;

 i = _values[ i ]; v += i;

 i = _values[ i ]; v += i;

 i = _values[ i ]; v += i;

 i = _values[ i ]; v += i;

 i = _values[ i ]; v += i;

 i = _values[ i ]; v += i;

 i = _values[ i ]; v += i;

 i = _values[ i ]; v += i;

return (v); }

Notice that it performs the exact same access pattern in both cases, and that each entry of the array is visited.

On a GeForce 8800 GTX, CUDA 2.0, with the following kernel:

__global__ void kernel_mem_perf_dependent(int *gpu_result,int *gpu_memtest) {

  int i      = blockIdx.x * blockDim.x + threadIdx.x;

  int j      = blockIdx.y * blockDim.y + threadIdx.y;

  int pix    = i + j * Width;

  gpu_result[pix] = gpu_dependent(gpu_memtest);


(there is another one calling gpu_non_dependent)

Both kernels take the exact same time (128 threads per block, 1.5 ms on a 256x256 array, 6.0 ms on a 512x512 and 24.1 ms on a 1024x1024 array).

I tried with larger tables (eg. 128 accesses) and various block sizes but I obtain the same behaviour each time.

The compiler is using multiple registers (as far as I can tell from the .ptx and .cubin).

What am I missing?