global memory latency

According to my measurement, global memory latency is about 400~600 cycles, and I test the latency on GeForce 9600GT it is about 572 cycles (about 342 ns).

Compare with same configuration memory type, CPU only need about 90+ ns. So I am very confused on the latency, why it need so long time to load data?

And the shared memory latency is about 57.5 cycle, but the official say that it only need 1 cycle, why?

if anyone know it, tell me. Thanks in advance

I used 1 block and only 1 thread in this block to load data from a buffer in dependent chain.

That’s a good question. I think part of it is that GDDR has higher latency than DDR. But mostly I think it is because the chip hasn’t been optimized for low latency.

I think that maybe it has to do with the texture fetch process, which has many stages that necessitate high latency. And it was a design decision to make regular global memory fetches go through the same pipeline and undergo as much latency.

Probably in the future the latency could be improved if the whole graphics-centric texture functionality is discarded. But keep in mind that low latency is never a priority in GPU design, because GPUs can hide latency with hyperthreading. Still, lower latency would let you conserve transistors by not requiring so many registers.

Btw, take a look at this thread: [url=“http://forums.nvidia.com/index.php?showtopic=80451”]http://forums.nvidia.com/index.php?showtopic=80451[/url]

Must be an error in how you’re measuring. I’ve found, though, that shared memory usually takes 1.5 cycles, not 1.

thank you for your relply!
your measured result of share memory is 1.5 cycle?
oh, my god!
below is my testing code of global memory and share memory:

/************************************************************


  • Micro-workload.cu
  • This is developed for testing latency.

*********/

#include <stdlib.h>
#include <stdio.h>
#include <cutil.h>

#define ELENUM (1024)
#define DATA_SIZE (4*ELENUM)
#define ITER 40000L

//for global memory latency
global static void Global_L(unsigned long * Buffer, long * time, long * result);

//for share memory latency
global static void Share_L(unsigned long * Buffer, long * time, long * result);

///************************************************************


/// Program main
/// Test memory latency
///************************************************************


int main( int argc, char** argv)
{
unsigned long * Buf_Glo, * Buf_Cpu;
long * time;
long * result;
char in;

int i = 0;
CUT_DEVICE_INIT();
Buf_Cpu = (unsigned long *)malloc(DATA_SIZE);
for(i=0; i< DATA_SIZE/sizeof(unsigned long); i+=4)
	Buf_Cpu[i] = i + 4;
Buf_Cpu[i-4] = 0;

cudaMalloc((void**) &Buf_Glo, DATA_SIZE);
cudaMalloc((void**) &time, sizeof(long));  
cudaMalloc((void**) &result, sizeof(long));   

long time_used = 0;
cudaMemcpy(Buf_Glo, Buf_Cpu, DATA_SIZE,cudaMemcpyHostToDevice);


//share memory latency test segment
printf("share memory latency test \n");
Share_L<<<1, 1, DATA_SIZE>>>(Buf_Glo,time,result);
cudaMemcpy(&time_used, time, sizeof(long), cudaMemcpyDeviceToHost);
printf("Iteration times: %ld, each latency: %f clock\n\n", ITER*16, (float)time_used/(ITER*16));

//global memory latency test segment
printf("global memory latency test \n");
Global_L<<<1, 1, 0>>>(Buf_Glo,time, result);	
cudaMemcpy(&time_used, time, sizeof(long), cudaMemcpyDeviceToHost);
printf("Iteration times: %ld, each latency: %f clock\n\n", ITER*16, (float)time_used/(ITER*16));

cudaFree(time);
cudaFree(Buf_Glo);
cudaFree(result);

scanf("%c", &in);
return 0;

}

//for global memory latency
global static void Global_L(unsigned long * Buffer, long * time, long * result)
{
unsigned long i = 0, index = 0, iteration = ITER;
unsigned long mask = DATA_SIZE/sizeof(unsigned long)-1;
* time = 0;

clock_t start = clock();
for(i = 0; i < iteration; i++) 
{
	index = Buffer[index] & mask; 
	index = Buffer[index] & mask;
	index = Buffer[index] & mask;
	index = Buffer[index] & mask;
	index = Buffer[index] & mask;
	index = Buffer[index] & mask; 
	index = Buffer[index] & mask;
	index = Buffer[index] & mask;		
	index = Buffer[index] & mask; 
	index = Buffer[index] & mask;
	index = Buffer[index] & mask;
	index = Buffer[index] & mask;
	index = Buffer[index] & mask;
	index = Buffer[index] & mask; 
	index = Buffer[index] & mask;
	index = Buffer[index] & mask;		
}     
*time = (clock() - start);
*result = index; 

}

//for share memory latency
global static void Share_L(unsigned long * Buffer, long * time, long * result)
{
unsigned long i = 0, index = 0, iteration = ITER;
extern shared unsigned long shared;
unsigned long mask = DATA_SIZE/sizeof(unsigned long)-1;
*time = 0;

//load the global memory into share memory
for(i=0; i< DATA_SIZE/sizeof(unsigned long); ++i)
{
	shared[i] = Buffer[i];
}  

clock_t start = clock();
for(i = 0; i < iteration; i++) 
{
	index = shared[index] & mask; 
	index = shared[index] & mask;
	index = shared[index] & mask;
	index = shared[index] & mask;
	index = shared[index] & mask;
	index = shared[index] & mask; 
	index = shared[index] & mask;
	index = shared[index] & mask;		
	index = shared[index] & mask; 
	index = shared[index] & mask;
	index = shared[index] & mask;
	index = shared[index] & mask;
	index = shared[index] & mask;
	index = shared[index] & mask; 
	index = shared[index] & mask;
	index = shared[index] & mask;		
}
*time += (clock() - start); 
*result = index; 

}

is there any problems?
can you share me your testing code?

A couple things to keep in mind.

  1. To perform one step of your code requries multiple instructions (fetch, mask, assign), each of which has a latency that you’re counting along with your total.

  2. Measuring latency of a shared mem fetch is not useful. In fact, latency as a concept is not important on GPUs. Latency can be hidden completely, so it only shows up when you’re running an insufficient number of threads. What counts is throughput. For shared memory, you can do about two fetches in three cycles (at least when doing matrix-multiply) when running at least 64 or 128 threads per SM.

There seems to be register read-after-write dependencies, which takes 24 cycles. Did you take that into account?