How to calculate memory latency(clock cycle)?

I tried to use the cuda timer outside the kernel function, but it’s difficult to extract the mem latency which is only part actions of the whole kernel.

And using cudathreadsynchronize() can not help time only one thread’s time cost.

So how to make a accurate timer for only one thread and how to extract the global mem latency, coalesced/uncoalesced shared mem latency, and texture mem latency?

The standard trick is to make a long dependent linked list, so a thread following the chain will have to wait for each memory access before going to the next link.

You might do this artifically by something like initializing 10000 words of device memory to 0 then with a single thread do:

int index=0;

while (index<10000) index=index+1+mem[index];

Now all the mem calls are identically 0 but the compiler and runtime don’t know it so it will test 10000 memory accesses.

Take your kernel runtime and divide by 10000.

You could use the CPU timer, but the runtime clock() callback is usually better to get very high precision. (Note that clock() is not wallclock based, though, it’s a shader clock counter.)

Of course latency is complex and multiflavored, since there are undocumented banks and caches and TLBs, each with subtle nuances, but this link following method will give you a quick general answer.

you mean in the kernel function, only those 2 lines u mentioned above are enough?

how to set the blocksize and gridsize?

in this case, it seems that all threads are accessing to the same mem, will there be any conflict which could increase the latency? Does the ‘mem’ here present global one or shared one?

thanks

uboat, here you’d likely just use one block with one thread.

Though you can argue it’d be interesting to see if one thread and 30 blocks (or however many SMs you have) would behave.
It gets complicated. But read that linked paper if you want to get into the details, especially Figure 1 which will show you there is no such thing as a single number to define “latency”.

but if only one thread per block, not all the SP can be working at the same time and most of them will be idle. Could this situation impact the performance and then make the latency not that accurate

You want the SPs idle when you are measuring latency. You want the timing to depend on the memory controller’s latency, without any extra processing delay, so the SP (even just one thread) should be doing nothing but waiting for that memory read to finish.

If you want to measure throughput/bandwidth, that’s an entirely different issue and pretty much independent of latency.

Real vvolkov’s paper, it’s got a ton of great detail.