Clock() and Clock64() Functions

Hello community,

I would like to understand how clock() and clock64() functions work. In the Cuda Programming Guide it is stated that

7.13. Time Function
clock_t clock(); long long int clock64();
when executed in device code, returns the value of a per-multiprocessor counter that is incremented every clock cycle. Sampling this counter at the beginning and at the end of a kernel, taking the difference of the two samples, and recording the result per thread provides a measure for each thread of the number of clock cycles taken by the device to completely execute the thread, but not of the number of clock cycles the device actually spent executing thread instructions. The former number is greater than the latter since threads are time sliced.

What does time-slicing of threads mean with respect to execution of instructions?

To be more clear, I want to get exact execution time of instructions. I am going to use this technique to classify the memory read operations, such as row buffer hit or row buffer conflict.
When I am using clock() or clock64() functions, I don’t see a clear difference in memory access times. All memory access times I see are ~600 clock cycles which is in the range of global memory access times. However, I want to get even the tiny difference in the execution time so that I can classify the memory access.

Thanks in advance.

The excerpted text says:

The former number is greater than the latter since threads are time sliced.

where the former number is:

the number of clock cycles taken by the device to completely execute the thread

and the latter is:

the number of clock cycles the device actually spent executing thread instructions

Suppose we have a sub-sequence of SASS instructions like this:

LD R0, [a]
LD R3, clock
LD R1, [b]
FMUL R2, R0, R1
LD R4, clock
ST [c], R2
IADD R5, R4, -R3
ST [diff], R5

We are multiplying 2 numbers which must be fetched into registers, and based on our position of the instructions to load from the clock register, we wish to time the duration of the 2nd load instruction and the multiply instruction (or something like that).

Let’s also assume that there are many warps executing this instruction stream.

The warp scheduler issues the first instruction for the first warp. There are no dependencies, and a load operation does not by itself ever cause a stall, so the warp scheduler in the next cycle issues the 2nd instruction for the first warp. Again, no dependencies, no stalls, so the warp scheduler in the 3rd cycle issues the 3rd instruction for warp 0. The clock register has already been sampled by this point (in the second instruction).

Now, the warp scheduler would like to proceed, but the FMUL instruction depends on previous register activity, and so we could imagine or posit a warp stall at this point for warp 0. So the warp scheduler than goes back to the first instruction and issues that for warp 1, and repeats the sequence, until stall, for warp 1. Likewise for warp 2 and so on. Somewhere after the 8th warp, but before the last warp, the stall on warp 0 unblocks/disappears.

The warp scheduler could go on to the 9th warp, issuing the first instruction, or it could go back to warp 0, and issue the FMUL instruction. What will it do? We don’t know, its not specified. Either option is possible. Let’s say it goes on to the 9th warp. That means that the instructions for warp 0 that have already been issued have finished their work, but the next instruction is not issued, yet. However the clock is still running, cycle by cycle.

The warp scheduler goes on to finish issuing the first 3 instructions for each of the 16 warps, and now it goes back to warp 0 and picks up where it left off. It issues the FMUL instruction, and, noticing that there are no dependencies, in the very next cycle it goes ahead and issues the next LD instruction, which samples the clock again.

We now have our two clock samples. But the difference in these clock samples is not purely due to activity in warp 0. Many other warps, and many other instructions, were issued, cycle by cycle, by that warp scheduler. The difference in clock samples is not purely due to a single set of instructions processed by warp 0.

This demonstrates the idea of time-slicing between threads. The warp scheduler has effectively time-sliced the resources of the SM across threads belonging to 16 warps, in this time-frame. The total time measured between the two clock samples does indeed account for the time it took to finish the work associated with those two instructions, but there was also other work that got included, such as the first instruction for other warps, which isn’t part of the instruction stream we had delineated, when we carefully placed our clock sampling point after the first instruction.

In this way, the measured time is greater than, or equal to, the time it took to process just the two instructions we had delineated - the second load instruction and the FMUL instruction.

This also highlights the difficulty associated with trying to use these facilities to measure instruction latency and nothing else. This is a difficult task in a GPU, and there are no built-in hardware monitors that tell you exactly how many cycles a particular instruction spent in a pipeline, or which clock cycles those were. In order to approach such measurements, its usually necessary to construct careful benchmarks that let the GPU operate as a whole, while inferring the per-instruction behavior, without actually directly measuring it.

An example of such careful benchmarking design is done by the citadel group. Its a non-trivial task.

2 Likes

Dear Robert,

Thank you so much for the great explanation with such a lecturing. Now I got the idea of how warp scheduler works and actually intervenes in architectural workflow.

Because warp scheduling creates such an overhead, let’s say the code is using only a single thread and there is no other process running on the gpu. Therefore, we wouldn’t expect an overhead from warp scheduling.

However, in this case as well, I don’t see a difference. All I see is, as I mentioned above, almost same, ~600 clock cycles.

What could be the issue now?

Thanks,
Best regards.

I was mostly trying to address the question of yours that I excerpted. Regarding the rest of your post, I’m not sure what you are asking. Global load latency could easily be on the order of 600 cycles, and you stated this yourself.

You seem to be expecting something different. I’m not sure why. The single thread case makes it especially unambiguous.

I don’t think so. If you only have a single warp (or a single thread) there is only one option in any given cycle, for the warp scheduler to choose from. All instructions are issued warp-wide, even when there is only 1 thread, and for any number of threads up to 32. There is no such concept as warp scheduler “contention” for threads within the same warp. They are scheduled together. Some instructions that target functional units in the SM that have fewer than 32 may stretch out over multiple back-to-back cycles to complete the issue process, but this does not apply to the LSU, AFAIK, for global memory accesses.

Furthermore, the global memory pipe, serviced by the LSU, is considered a variable latency pipe. Therefore its not reasonable to assume that you will get the same exact latency, every time you measure it. However if you are issuing a request to global memory that does not hit in the L2, and you measure ~600 cycles, my guess is you will measure ~600 cycles for every such attempt. But if you measured e.g. 632 cycles on one attempt, I would not assume you will measure 632 cycles on every such attempt. There may be some variability.

1 Like

Again thanks a lot for the detailed answer.

What I am trying to do is, for example, in the SASS code you provided, to compare if a has caused any row buffer contention to b. The row buffer contention should be visible I believe, in other words, clock cycles should be different.
Also, I am not just running for one time. I run this code for multiple times, each time bypassing the cache. The average time still does not give any clue.

I don’t know what “row buffer contention” is.

To facilitate further discussion, it might be better if you provide a short complete example.

I am sorry for the inconvenience.

Sure. Briefly talking, DRAM memory (so as GDDR) is organized as memory arrays/ banks where the data stays in a 2D array fashion with rows and columns. On commodity GPU circuits, the GDDR is installed as different chips. Each chip has 16 memory arrays or banks.

When we access an address in a DRAM, MMU decides which chip it is located in, then which bank/memory array. Then the row in which the access address is is buffered in the so called “row buffer” since it cannot read the data directly from a row of a bank (it is the physics of the DRAM technology). In this regard, row buffer is acting like a cache memory inside the DRAM. Then required column of data is read from the row buffer.
Now, lets say we access another address in DRAM. And this address is located in the same chip/same bank, but in a different row. What happens is that DRAM controller first closes the row buffer, meaning that the data is written back to the previous row itself (this happens in both read and writes), then activates the row for the current required address. Then it reads from the row buffer. So this process, closing the previous row & activating the new row is what row buffer contention is.

If the second access is from the same row, then the access would be directly served from the row buffer itself.

Row buffer is acting like a small cache inside a DRAM bank. Missing this cache and hitting this cache should make a difference in access times.

I hope my point is clear now.

That sort of analysis means you have to know (or are attempting to determine - even harder) the virtual address to physical dram cell mapping. I don’t happen to know that, but my understanding is that naive ideas might not describe the situation. Long ago there was an observable phenomenon called partition camping, and AFAIK NVIDIA worked around that at least partially by swizzling the virtual to physical cell mapping.

Furthermore it might be difficult to assess such things, because you have the L2 cache (at least) in the way. AFAIK there is no way to bypass the L2 and the L2 will populate at least a 32-byte sector/line, on any given access (or perhaps multiple 32-byte sectors/lines). So this is a grille you would also have to look through, in order to see any underlying pattern such as you are describing.

To give an example, I think it is possible that the four bytes that comprise a single float quantity, for example, could be spread across 4 different DRAM chips, depending on the GPU design. Or, alternatively, that an entire row from a single DRAM chip constitutes one sector and one 32-byte cacheline in the L2.

Sounds difficult to observe (and so identify) that pattern.

Good luck!

Dear Robert,

Thanks for all of your help.

Best.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.