Does %clock measure actual GPU cycles, or what?

I have a short device program which loads %clock before and after some code and reports the difference.
Here’s the program:
if ( num )
{
asm volatile (
“.reg .pred %p;\n\t”
“mov.u32 %0, %%clock;\n\t”
“loop1:\n\t”
“sub.u32 %2, %2, 1;\n\t”
//“sub.u32 %2, %2, 0;\n\t”
//“sub.u32 %2, %2, 0;\n\t”
//“sub.u32 %2, %2, 0;\n\t”
“setp.ne.s32 %p, %2, 0;\n\t”
“@%p bra.uni loop1;\n\t”

			"mov.u32 %1, %%clock;"
			:
		"=r"(start),
			"=r"(stop)
			:
			"r" (num));
	}
	else
	{
		asm volatile (
			""
			"mov.u32 %0, %%clock;\n\t"
			"mov.u32 %1, %%clock;"
			:
		"=r"(start),
			"=r"(stop)
			:
			"r" (n));
	}

Results:
Timer 0: 51 (cycles)
Timer 10: 890 (cycles)
Timer 20: 1726 (cycles)
Timer 30: 2546 (cycles)
Timer 40: 3375 (cycles)
Timer 50: 4219 (cycles)
Timer 60: 5040 (cycles)
Timer 70: 5866 (cycles)
Timer 80: 6706 (cycles)
Timer 90: 7527 (cycles)
Timer 100: 8356 (cycles)
Results with 1, 2, or 3 of the sub instructions uncommented:
Timer 0: 51 (cycles)
Timer 10: 1060 (cycles)
Timer 20: 2060 (cycles)
Timer 30: 3063 (cycles)
Timer 40: 4060 (cycles)
Timer 50: 5060 (cycles)
Timer 60: 6056 (cycles)
Timer 70: 7063 (cycles)
Timer 80: 8060 (cycles)
Timer 90: 9060 (cycles)
Timer 100: 10060 (cycles)

Timer 0: 51 (cycles)
Timer 10: 1226 (cycles)
Timer 20: 2405 (cycles)
Timer 30: 3569 (cycles)
Timer 40: 4731 (cycles)
Timer 50: 5905 (cycles)
Timer 60: 7076 (cycles)
Timer 70: 8246 (cycles)
Timer 80: 9415 (cycles)
Timer 90: 10596 (cycles)
Timer 100: 11768 (cycles)

Timer 0: 51 (cycles)
Timer 10: 1406 (cycles)
Timer 20: 2737 (cycles)
Timer 30: 4089 (cycles)
Timer 40: 5419 (cycles)
Timer 50: 6759 (cycles)
Timer 60: 8111 (cycles)
Timer 70: 9442 (cycles)
Timer 80: 10792 (cycles)
Timer 90: 12119 (cycles)
Timer 100: 13467 (cycles)

What I read from this is:

  1. Two %clock reads in a row differ by 51 consistently.
  2. Executing the loop takes about 84 per iteration.
  3. Each extra sub.32 takes about 16 or 17 per iteration.

I have a GeForce GT 710, running with clock at 135 MHz, which quickly ramps up to 953 MHz. If I run the test while the clock is already at 953 MHz, the results are the same.

Is %clock actually counting real GPU cycles, and if so, why does the program take so long to run? Or is it counting in faster units than the clock cycle?

What I want to be able to do is determine exactly (or on average, at least) the difference in cycles from one point to another in the program. Any suggestions for a different mechanism would be welcome.

Thanks.

The clock register (%clock) reads core clock cycles. That is referring to the GPU core clock, i.e. the thing that varies between 135 and 953 MHz on your GPU.

It’s often hard for me to understand why people would write source code in PTX. Of course it can be done, but:

  • many of the things you can do in PTX are also doable in C. This appears to be in that bucket. You can read the clock registers directly from CUDA C/C++ source code, no PTX required.
  • PTX is not what the machine executes, the machine executes SASS, and PTX is compiled to SASS code by an optimizing compiler (ptxas, or the GPU driver) which means that the SASS code may or may not resemble the PTX code. This means that trying to understand low-level execution behavior by staring at PTX can be frustrating.

This isn’t a complete program. It is at best a GPU kernel, which, by itself, is not a complete program. Execution behavior (e.g. low-level instruction scheduling) can certainly be influenced by how many warps are in each block you issue, and to a lesser extent, how many blocks, for example.

You may also be operating under some false assumptions:

  • instructions can be issued as soon as they can be fetched - not true. Instruction can be issued when all dependencies have been met
  • gpu instructions execute in one cycle - also not true. Most execution units are pipelined, and the pipeline depth of an execution unit can easily show up as a scheduling delay of the next instruction, if that instruction is dependent on the previous instruction/pipeline.

Anyway the combination/upshot of all this is that you’re a long way from being able to understand what is going on here. Furthermore, if you don’t provide a complete code, I doubt anyone can concisely walk you through it.

Start by providing a short complete program, and start by studying the SASS. That is what the machine executes.

It’s also nice if you properly format code when you post it here. Start by selecting your code in the edit window, then click the code button (</>) at the top toolbar of the edit window, to mark that section as code.

Without a minimal reproducible and target GPU it is hard to provide more detailed information on your specific case.

  1. Only review SASS. The compiler will move %clock instructions as the compiler historically does not treat %clock as an instruction fence.

  2. On CC < 7.0 %clock translates into S2R SR_ClockLo. The S2R instruction has variable latency and is issued
    into an instruction FIFO with other MIO instruction such as shared memory loads and stores. If the instruction is not treated as a fence it may issue before the body but it may actually read after the instructions in the body retire. For CC >= 7.0 the compiler should generate CS2R SR_ClockLo which has a fixed latency.

  3. If the compiler generates code in the sequence listed in your program the two clock reads are measuring the issue latency (time between issuing the two instructions. This can be as low a 1 clock cycle so the 51 cycles entails other latency.

  4. Measuring and understanding the execution latency of shader code is very tricky once a warp scheduler has greater than 1 warp as now the time is also measuring decisions made by the warp scheduler regarding instruction fetch and instruction issue. If memory operations are involved then misses also come into play as memory operations to texture and global/local memory complete in order on misses so another warps miss can result in an increase in latency for all other warps.

My recommendation is focus on the kernel’s throughput as opposed to average warp latency.

The warp stall reasons reported both as kernel metrics and via program counter sampling can help explain why the execution latency between two instructions is higher than expected.

Thanks, Robert, for taking the time to answer in such detail.

  1. Regarding my use of inline PTX, I have two reasons. (a) As far as I know, I cannot directly access the %clock register from C code. The clock () function generates a call to that function on the device, which is not the same as loading the register directly. (b) I am looking to write an optimized kernel, and so I want to have the control over the instructions that PTX gives me. In fact, I expect to assemble the PTX directly using a KeplerAs program I found on gitHub, rather than CUDA asm () constructs, in order to have control over register assignments.

  2. Here’s the complete program, host and device code (see at the bottom). Sorry for not including this originally; I figured that what portion I supplied would provide enough information to answer my queries. You will see that it is running with just 32 threads and one block. The %clock code runs only in thread 0, to avoid interference between the %clock references in all threads at the same time. By the way, setting num_threads = 1 gives the same timing results.

  3. I tried running my kernel for a longer time period by putting a loop around the timeIt body and doing the copy of the result to g_odata [bid] only the first time. Then I timed the 10 invocations of timeIt () on the host side. Briefly, the host times amount to about 1.2 ns per unit of %clock difference reported (which is most of the time the kernel is running). This agrees with the idea that %clock is running at 953 MHz.

  4. So I hope I can get a comment on my original question, which is why consecutive sub.u32 %count, %count, 0 instructions are being separated by about 16 clock cycles, as well as the other two numbers cited in my OP?

  5. I understand that some instructions are heavily pipelined, but I didn’t expext that a sub.u32 would take so long to complete. Does anyone have a reference to a table of measurements of latency and throughput of the various instructions on Kepler?

  6. As another experiment, I replaced the sub sequence with

...
				"sub.u32 %count, %count, 1;\n\t"
				"mov.u32 %dummy, 0;\n\t"
				"sub.u32 %count, %count, 0;\n\t"
				"sub.u32 %dummy, %dummy, 1;\n\t"
				"sub.u32 %count, %count, 0;\n\t"
				"sub.u32 %dummy, %dummy, 1;\n\t"
				"sub.u32 %count, %count, 0;\n\t"
				"sub.u32 %dummy, %dummy, 1;\n\t"

with no extra time being taken. That shows that the processor doesn’t actually stall until the next dependent instruction occurs.

Device and host programs…

__global__ void
timeIt (int num, int *g_odata)
{
	// access thread id
	const unsigned int tid = threadIdx.x;
	// access number of threads in this block
	const int bid = blockIdx.x;
	//const unsigned int num_threads = blockDim.x;

	if ( tid == 0 )
	{
		int n = 0;
		int start, stop;
		if ( num )
		{
			asm volatile (
				".reg .pred %p;\n\t"
				".reg .u32 %count;\n\t"
				"mov.u32 %count, %2;\n\t"
				"mov.u32 %0, %%clock;\n\t"
				"loop1:\n\t"
				"sub.u32 %count, %count, 1;\n\t"
				"sub.u32 %count, %count, 0;\n\t"
				"sub.u32 %count, %count, 0;\n\t"
				"sub.u32 %count, %count, 0;\n\t"
				"setp.ne.u32  %p, %count, 0;\n\t"
				"@%p   bra.uni  loop1;\n\t"

				"mov.u32 %1, %%clock;"
				:
			"=r"(start),
				"=r"(stop)
				:
				"r" (num));
		}
		else
		{
			asm volatile (
				""
				"mov.u32 %0, %%clock;\n\t"
				"mov.u32 %1, %%clock;"
				:
			"=r"(start),
				"=r"(stop)
				:
				"r" (n));
		}
		g_odata [bid] = stop - start;
	}
}

int
main(int argc, char **argv)
{
    int devID = findCudaDevice(argc, (const char **)argv);

    const unsigned int num_threads = 32;
    unsigned int mem_size = sizeof(int) * num_threads;

    // allocate device memory for result
	int *d_odata;
    checkCudaErrors(cudaMalloc((void **) &d_odata, mem_size));

    // setup execution parameters
    dim3  grid(1, 1, 1);
    dim3  threads(num_threads, 1);

	// allocate mem for the result on host side
	int h_odata [num_threads];

	for ( int num = 0 ; num <= 100 ; num += 10 )
	{
		// execute the kernel
		timeIt<<< grid, threads, mem_size >>>(num, 1, d_odata);

		// check if kernel execution generated and error
		getLastCudaError("Kernel execution failed");

		// copy result from device to host
		checkCudaErrors(cudaMemcpy(h_odata, d_odata, sizeof(int) * num_threads,
								   cudaMemcpyDeviceToHost));

		printf ("Timer %3d: %5d (cycles)\n", num, h_odata [0]);
	}

    // cleanup memory
    checkCudaErrors(cudaFree(d_odata));

    return 0;
}

There are whitepapers on microbenchmarking GK110 that put dependent latency for SUB at 9 cycles.

http://lpgpu.org/wp/wp-content/uploads/2013/05/poster_andresch_acaces2014.pdf

Measuring a single instruction (or small set) with clock is not advised on older architectures. It would be really helpful if you would post the SASS. There is nothing wrong with writing in PTX. In order to understand what is executed you should use --keep options and disassemble the SASS.

If you are trying to calculate dependent execution latency I would recommend increasing the loop body by 10-20x to remove the impact of the setp and bra. Both setp and bra have a longer pipeline delay than sub. The Nsight VSE profiler and CUDA profiler for GK110 should have warp stall reasons. The raw counts should be available in Nsight VSE profiler. If you divide the count by inst_executed you can get an idea of the number of cycles the warp stalled for each of the reasons. If the loop1 body is increased 10-20x to remove impact of setp and bra I expect you will converge on 9 cycles between dependent sub instructions.

Actually, I don’t have a GK110 after all, I was mistaken. All I know is that it is CUDA level 3.5 and has only 1 set of 192 cores, not 15 sets.

Question answered.
Looking at the control codes for the ISA binary, I see that all instructions have a 16 cycle (sometimes slightly fewer) stall control code.
It happens that with a -G switch to nvcc, this happens. Without it, even with -g, it’s OK. Now the control codes are more reasonable. Including the 9 cycles for an integer subtract.