Delay between two (uncoalesced/ coalesced ) memory transactions

i have a machine composed of one CPU Intel® Xeon® CPU E5620@2.40GHz, with a NVIDIA
GPU K40 (with 6 GB of RAM) has been used for the GPU computation, i want to now how can i find this variable?

Delay between two uncoalesced memory transactions.
Delay between two coalesced memory transactions.
DRAM access latency.


I don’t actually know what is meant by “delay between two (un)coalesced memory transactions”. If you want to know how to perform microbenchmarks of the GPU memory subsystem, have a look at this paper:

There are other such papers, so a literature search would seem like a good idea. It is not clear for what purpose you need this information, but any questions about latencies in the context of GPUs always raise a red flag with me, because GPUs are, by design, throughput-optimized processors, not latency-optimized like most CPUs. So questions about latency often hint at an approach to a computational problem that is not a good match for a GPU.

This paper is also a good one, “Demystifying GPU Microarchitecture through Microbenchmarking”.

Even though it does not have all the details of measuring DRAM access latency, you could follow the same “pointer-chasing dependent reads” strategy to measure it(see page 7, part global memory).

By “delay between two (un)coalesced memory transactions”, my understanding is that warp A executes (un)coalesced memory transaction, what is the delay between the warp and the following warp executing (un)coalesced memory transaction.

I need this information to applying the analytical model “Sunpyo Hong and Hyesoon Kim (ISCA 2010).” to my algorithm tested on Tesla K40.

Your opinion please


have you a documentation about other works in “An Analytical Model for a GPU Architecture”?


From Sunpyo Hong and Hyesoon Kim, “An Integrated GPU Power and Performance Model”:

That clarifies what motivated your question, unfortunately I don’t know what the authors might mean by that metric (they don’t seem to mean plain old memory latency) and I could not find in the paper what value they assumed for this metric in the case of the GTX 280, which would give us a better idea of what needs to be measured in a microbenchmark when moving the analysis to the K40.

the value the auther assumed for this metric in case of the GTX280 is resumed in the table 6 in the paper page 9.
hong_isca09(2).pdf (613 KB)

There is no table 6 in my copy of that paper, the last table is table 4. Either you are looking at a different version of the paper (mine may be a pre-print, I can’t tell), or a different paper altogether. Side remark: In computer science, it is customary to cite the full title of a paper to avoid confusion, this differs markedly from the cryptic references (author, year) used in physics or chemistry.

the title is:
" An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness"

For the GTX 280, the entries in table 6 for the GTX 280 read:

Departure_del_uncoal 40
Departure_del_coal 4

I have no idea what that means. Somewhere else in the paper they state “The latency of each memory warp is at least Mem_L cycles. Departure_delay is the minimum departure distance between two consecutive memory warps.”, so presumably the unit of the values is cycles. Not clear whether these are cycle with respect to core clock or memory clock or hot clock, and it is not clear how this would be measured. Maybe someone else can make heads & tails of this, I don’t have the time to read the entire paper (assuming it is explained in there).

Mem_LD in the same table presumably gives the memory latency in nanoseconds, and the stated value of 450 is in the range I would expect, as memory latency was typically in the 400 to 600 ns range for GPUs of that generation. This contrasts with a system memory latency of 70-100 ns in x86-based systems of the time.