# how does clock() work

I am trying to understand how clock() works on TITAN (cc3.5). Here is the CUDA code modified from this benchmark: http://www.stuffedcow.net/research/cudabmk

``````#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define repeat2(S) S S
#define repeat3(S) repeat2(S) S
#define repeat4(S) repeat2(S S)
#define repeat5(S) repeat4(S) S

__global__ void kclock_test2 (unsigned int *ts, unsigned int *out, int p1, int p2, unsigned int its)
{
unsigned int t1 = p1;
unsigned int t2 = p2;
unsigned int start_time = 0, stop_time = 0;
unsigned int pid = blockIdx.x*blockDim.x + threadIdx.x;

for (int i = 0; i < its; i++)
{
start_time = clock();
repeat5(t1+=t2;t2+=t1;);
stop_time = clock();
}

out[0] = t1+t2;
ts[pid*2] = start_time;
ts[pid*2+1] = stop_time;
}

int main()
{
//measure_clock();
unsigned int ts[1024];
unsigned int *d_ts;
unsigned int *d_out;

// Allocate device array.
cudaError_t errcode;
if (cudaSuccess != (errcode = cudaMalloc((void**)&d_ts, sizeof(ts))))
{
printf ("cudaMalloc failed %s:%d\n", __FILE__, __LINE__);
printf ("   %s\n", cudaGetErrorString(errcode));
return 0;
}
if (cudaSuccess != (errcode = cudaMalloc((void**)&d_out, 4)))
{
printf ("cudaMalloc failed %s:%d\n", __FILE__, __LINE__);
printf ("   %s\n", cudaGetErrorString(errcode));
return 0;
}
cudaGetLastError();

dim3 Db = dim3(1);
dim3 Dg = dim3(1,1,1);
Dg.x = 14; //1 block per SM
Db.x = 1;  //1 thread per block

kclock_test2 <<<Dg, Db>>>(d_ts, d_out, 4, 6, 2);
cudaMemcpy(ts, d_ts, sizeof(ts), cudaMemcpyDeviceToHost);
for (int i=0; i < 14; i++)
printf ("  Block %02d: start: %08u, stop: %08u, diff: %u\n", i, ts[i*2], ts[i*2+1], ts[i*2+1]-ts[i*2]);
cudaFree (d_ts);
cudaFree (d_out);
return 0;
}
``````

When I disassembled the code using cuobjdump (9.0) here is the relevant part of the SASS code:

``````S2R R8, SR_CLOCKLO;                        //SCHI: 0x2f  16 cycles
IADD R2, R2, R3;                           //SCHI: 0x28  9 cycles
IADD R3, R3, R2;                           //SCHI: 0x28  9 cycles
IADD R2, R2, R3;                           //SCHI: 0x28  9 cycles
IADD R3, R3, R2;                           //SCHI: 0x28  9 cycles
IADD R2, R2, R3;                           //SCHI: 0x28  9 cycles
IADD R3, R3, R2;                           //SCHI: 0x28  9 cycles
IADD R2, R2, R3;                           //SCHI: 0x28  9 cycles
IADD R3, R3, R2;                           //SCHI: 0x28  9 cycles
IADD R2, R2, R3;                           //SCHI: 0x2f  16 cycles
S2R R7, SR_CLOCKLO;                        //SCHI: 0x2f  16 cycles
``````

From the stall counts of each instruction, the difference between R7 and R8 should be 104=16+9*8+16 theoretically. But when I run the code on Ubuntu 18.04 with TITAN, here is the result

``````Block 00: start: 01816203, stop: 01816307, diff: 104
Block 01: start: 01815641, stop: 01815745, diff: 104
Block 02: start: 01815448, stop: 01815552, diff: 104
Block 03: start: 01815143, stop: 01815247, diff: 104
Block 04: start: 01816237, stop: 01816341, diff: 104
Block 05: start: 01815643, stop: 01815747, diff: 104
Block 06: start: 01815446, stop: 01815550, diff: 104
Block 07: start: 01815134, stop: 01815238, diff: 104
Block 08: start: 01814428, stop: 01814570, diff: 142
Block 09: start: 01816237, stop: 01816341, diff: 104
Block 10: start: 01815644, stop: 01815791, diff: 147
Block 11: start: 01815446, stop: 01815550, diff: 104
Block 12: start: 01815143, stop: 01815247, diff: 104
Block 13: start: 01814425, stop: 01814529, diff: 104
``````

Here are my questions with my assumptions

1. I am launching 14 blocks because I assume each block will run on each SMX. From the result, the diff is 104 as expected from some SMs, but not true for others. Why different SMs behave differently?

2. Every time I run the code, I got different numbers for each SM. Does the clock() value depends on some runtime environment? Maybe the hardware does not respect the stall counts completely?

3. I learned that the clock reading instruction (S2R R1, SR_CLOCKLO) is a varying-latency instruction for cc<7.0. However, when I try to put two clock reading instruction back2back, I always get the same diff (16 cycles) between the two readings. Does it mean the latency of the clock reading instruction is always 16 cycles?

Starting with your first question, I don’t believe, I assume each block will run on each SMX, is a true statement. If you check out https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities, you’ll see that you can have 16 resident blocks, 64 resident warps, and 2048 resident threads on a single CC 3.5 SM. Given that you’re launching a single thread per block and only 14 blocks, I don’t see a reason why all blocks couldn’t be ran on a single SM. I know blocks are distributed with a hardware scheduler, but I’m not sure about the logical behind the mechanism.

Also, you might want to take a look at the clock example in the CUDA Toolkit samples. There is some rationale provided in there as well.

Number 2, I don’t believe there is a way to tell which SM a block was actually launched on. Please see Bob’s comment below.

The description in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#time-function is quite simple, leading me to believe there is not runtime dependency.

Number 3, where did you read that clock() had varying-latency? Not arguing the statement, just curious. I think it is quite possible that the GPU is doing so little work you’re not seeing a difference.

You can tell which block an SM was launched on. An SM has a unique id that is readable from a hardware special purpose register.

I tried to get the sm id for each block and here is the output

``````block 9 --> smid 4
block 4 --> smid 9
block 0 --> smid 13
block 10 --> smid 3
block 5 --> smid 8
block 1 --> smid 12
block 7 --> smid 6
block 12 --> smid 1
block 3 --> smid 10
block 13 --> smid 0
block 8 --> smid 5
block 6 --> smid 7
block 11 --> smid 2
block 2 --> smid 11
``````

I have launched 14 blocks because TITAN has 14 SM’s. I expect each block to be scheduled on one SM individually. This is verified by the above output.

1. Why the time difference between two clock() calls differs from SM to SM?

I found the latency of clock() in the following post
https://devtalk.nvidia.com/default/topic/1056495/cuda-programming-and-performance/does-clock-measure-actual-gpu-cycles-or-what-/post/5356523/#5356523

There may possibly be things like instruction cache effects. I can’t be specific, but I found that if I launched the kernel many times, eventually the behavior across SMs became more comparable:

``````\$ cat t1650.cu
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define repeat2(S) S S
#define repeat3(S) repeat2(S) S
#define repeat4(S) repeat2(S S)
#define repeat5(S) repeat4(S) S

static __device__ __inline__ int __mysmid(){
int smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}

__global__ void kclock_test2 (unsigned int *ts, unsigned int *out, unsigned int *sm_id, int p1, int p2, unsigned int its)
{
unsigned int t1 = p1;
unsigned int t2 = p2;
unsigned int start_time = 0, stop_time = 0;
unsigned int pid = blockIdx.x*blockDim.x + threadIdx.x;

for (int i = 0; i < its; i++)
{
start_time = clock();
repeat5(t1+=t2;t2+=t1;);
stop_time = clock();
}

out[0] = t1+t2;
ts[pid*2] = start_time;
ts[pid*2+1] = stop_time;
sm_id[blockIdx.x] = __mysmid();
}
int mpc(){

cudaGetDeviceProperties(&prop, 0);

printf(" Name: %s\n",prop.name );
printf(" Compute capability: %d.%d\n", prop.major, prop.minor );
printf(" Clock rate: %d\n",prop.clockRate );
printf(" Total global memory: %ld (%d MB)\n", prop.totalGlobalMem, int(prop.totalGlobalMem*9.5367e-7));
printf(" Multiprocessor count: %d\n", prop.multiProcessorCount);
return prop.multiProcessorCount;
}
int main()
{
//measure_clock();
unsigned int ts[1024];
unsigned int *d_ts;
unsigned int *d_out;
unsigned int *d_sm_id;
int my_mpc = mpc();
cudaMalloc(&d_sm_id, my_mpc*sizeof(d_sm_id[0]));
// Allocate device array.
cudaError_t errcode;
if (cudaSuccess != (errcode = cudaMalloc((void**)&d_ts, sizeof(ts))))
{
printf ("cudaMalloc failed %s:%d\n", __FILE__, __LINE__);
printf ("   %s\n", cudaGetErrorString(errcode));
return 0;
}
if (cudaSuccess != (errcode = cudaMalloc((void**)&d_out, 4)))
{
printf ("cudaMalloc failed %s:%d\n", __FILE__, __LINE__);
printf ("   %s\n", cudaGetErrorString(errcode));
return 0;
}
cudaGetLastError();

dim3 Db = dim3(1);
dim3 Dg = dim3(1,1,1);
Dg.x = my_mpc; //1 block per SM
Db.x = 1;  //1 thread per block

kclock_test2 <<<Dg, Db>>>(d_ts, d_out, d_sm_id, 4, 6, 2);
kclock_test2 <<<Dg, Db>>>(d_ts, d_out, d_sm_id, 4, 6, 2);
kclock_test2 <<<Dg, Db>>>(d_ts, d_out, d_sm_id, 4, 6, 2);
kclock_test2 <<<Dg, Db>>>(d_ts, d_out, d_sm_id, 4, 6, 2);
kclock_test2 <<<Dg, Db>>>(d_ts, d_out, d_sm_id, 4, 6, 2);
int *h_sm_id = new int[my_mpc];
cudaMemcpy(ts, d_ts, sizeof(ts), cudaMemcpyDeviceToHost);
cudaMemcpy(h_sm_id, d_sm_id, my_mpc*sizeof(d_sm_id[0]), cudaMemcpyDeviceToHost);
for (int i=0; i < my_mpc; i++)
printf ("  Block %02d: start: %08u, stop: %08u, diff: %u, smid: %d\n", i, ts[i*2], ts[i*2+1], ts[i*2+1]-ts[i*2], h_sm_id[i]);
cudaFree (d_ts);
cudaFree (d_out);
return 0;
}
\$ nvcc -o t1650 t1650.cu -arch=sm_35
\$ CUDA_VISIBLE_DEVICES="1" ./t1650
Name: Tesla K20Xm
Compute capability: 3.5
Clock rate: 732000
Total global memory: 5977800704 (5700 MB)
Multiprocessor count: 14
Block 00: start: 2521671771, stop: 2521671875, diff: 104, smid: 13
Block 01: start: 204850824, stop: 204850928, diff: 104, smid: 12
Block 02: start: 530193055, stop: 530193159, diff: 104, smid: 11
Block 03: start: 4087280488, stop: 4087280592, diff: 104, smid: 10
Block 04: start: 2521671768, stop: 2521671872, diff: 104, smid: 9
Block 05: start: 204850858, stop: 204850962, diff: 104, smid: 8
Block 06: start: 530193035, stop: 530193139, diff: 104, smid: 7
Block 07: start: 4087280442, stop: 4087280546, diff: 104, smid: 6
Block 08: start: 1560231018, stop: 1560231122, diff: 104, smid: 5
Block 09: start: 2521671757, stop: 2521671861, diff: 104, smid: 4
Block 10: start: 204850859, stop: 204850963, diff: 104, smid: 3
Block 11: start: 530193035, stop: 530193139, diff: 104, smid: 2
Block 12: start: 4087280486, stop: 4087280590, diff: 104, smid: 1
Block 13: start: 1560231017, stop: 1560231121, diff: 104, smid: 0
``````

It’s still not always like the above output, but initially my output was quite chaotic, and as I added more iterations to the kernel launches, it became more predictable.

And it also seems like the variable latency you mentioned already may be a plausible reason for some variation.

When I run the above code, with or without the extra kernel launches, on my Tesla V100 I get 56 across the board for diff. Therefore I think variable latency is a likely explanation.

Thanks Robert. It helps a lot.

I compiled the same code with -arch=sm_61 and checked the sass code with cuobjdumo (10.0). I got several pairs of curly braces in the output sass code. The following code is the beginning of the sass code and it shows the braces at the end. Do you know what do them mean? I do not see braces with -arch=sm_35 but do see them with -arch=sm_50 and -arch=sm_60.

``````Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

code for sm_61
Function : _Z12kclock_test2PjS_iijS_