How to calculate memory latency(clock cycle)?

uboat · February 3, 2010, 12:45am

I tried to use the cuda timer outside the kernel function, but it’s difficult to extract the mem latency which is only part actions of the whole kernel.

And using cudathreadsynchronize() can not help time only one thread’s time cost.

So how to make a accurate timer for only one thread and how to extract the global mem latency, coalesced/uncoalesced shared mem latency, and texture mem latency?

SPWorley · February 3, 2010, 1:49am

The standard trick is to make a long dependent linked list, so a thread following the chain will have to wait for each memory access before going to the next link.

You might do this artifically by something like initializing 10000 words of device memory to 0 then with a single thread do:

int index=0;

while (index<10000) index=index+1+mem[index];

Now all the mem calls are identically 0 but the compiler and runtime don’t know it so it will test 10000 memory accesses.

Take your kernel runtime and divide by 10000.

You could use the CPU timer, but the runtime clock() callback is usually better to get very high precision. (Note that clock() is not wallclock based, though, it’s a shader clock counter.)

Of course latency is complex and multiflavored, since there are undocumented banks and caches and TLBs, each with subtle nuances, but this link following method will give you a quick general answer.

uboat · February 3, 2010, 2:28am

The standard trick is to make a long dependent linked list, so a thread following the chain will have to wait for each memory access before going to the next link.

You might do this artifically by something like initializing 10000 words of device memory to 0 then with a single thread do:
int index=0;

while (index<10000) index=index+1+mem[index];
Now all the mem calls are identically 0 but the compiler and runtime don’t know it so it will test 10000 memory accesses.

Take your kernel runtime and divide by 10000.

You could use the CPU timer, but the runtime clock() callback is usually better to get very high precision. (Note that clock() is not wallclock based, though, it’s a shader clock counter.)

Of course latency is complex and multiflavored, since there are undocumented banks and caches and TLBs, each with subtle nuances, but this link following method will give you a quick general answer.

you mean in the kernel function, only those 2 lines u mentioned above are enough?

how to set the blocksize and gridsize?

in this case, it seems that all threads are accessing to the same mem, will there be any conflict which could increase the latency? Does the ‘mem’ here present global one or shared one?

thanks

SPWorley · February 3, 2010, 3:46am

uboat, here you’d likely just use one block with one thread.

Though you can argue it’d be interesting to see if one thread and 30 blocks (or however many SMs you have) would behave.
It gets complicated. But read that linked paper if you want to get into the details, especially Figure 1 which will show you there is no such thing as a single number to define “latency”.

uboat · February 5, 2010, 2:01am

but if only one thread per block, not all the SP can be working at the same time and most of them will be idle. Could this situation impact the performance and then make the latency not that accurate

SPWorley · February 5, 2010, 7:50am

You want the SPs idle when you are measuring latency. You want the timing to depend on the memory controller’s latency, without any extra processing delay, so the SP (even just one thread) should be doing nothing but waiting for that memory read to finish.

If you want to measure throughput/bandwidth, that’s an entirely different issue and pretty much independent of latency.

Real vvolkov’s paper, it’s got a ton of great detail.

Topic		Replies	Views
How to measure texture mem latency? CUDA Programming and Performance	0	955	February 5, 2010
Hiding memory read latency CUDA Programming and Performance	0	1776	July 16, 2007
global memory latency CUDA Programming and Performance	6	6150	December 24, 2008
memory latency CUDA Programming and Performance	5	4073	March 21, 2007
latency for global memory access (in cycles) CUDA Programming and Performance	0	2336	August 8, 2008
global memory latency CUDA Programming and Performance	12	16382	December 13, 2007
latency latency on tesla c1060 CUDA Programming and Performance	4	7235	October 21, 2009
CUDA clock() issue CUDA Programming and Performance	12	2544	March 7, 2017
cost of clock64() CUDA Programming and Performance	18	6733	November 23, 2015
Global memory access latency access latency is as much as 1200 cycles CUDA Programming and Performance	28	17937	August 13, 2009

How to calculate memory latency(clock cycle)?

Related topics