how does clock() work

lxzhang · January 22, 2020, 6:04pm

I am trying to understand how clock() works on TITAN (cc3.5). Here is the CUDA code modified from this benchmark: Demystifying GPU Microarchitecture through Microbenchmarking | stuffedcow

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define repeat2(S) S S
#define repeat3(S) repeat2(S) S
#define repeat4(S) repeat2(S S)
#define repeat5(S) repeat4(S) S

__global__ void kclock_test2 (unsigned int *ts, unsigned int *out, int p1, int p2, unsigned int its)
{
    unsigned int t1 = p1;
    unsigned int t2 = p2;
    unsigned int start_time = 0, stop_time = 0;
    unsigned int pid = blockIdx.x*blockDim.x + threadIdx.x;

    for (int i = 0; i < its; i++)
    {
        start_time = clock();
        repeat5(t1+=t2;t2+=t1;);
        stop_time = clock();
    }

    out[0] = t1+t2;
    ts[pid*2] = start_time;
    ts[pid*2+1] = stop_time;
}

int main()
{
    //measure_clock();
    unsigned int ts[1024];
    unsigned int *d_ts;
    unsigned int *d_out;

    // Allocate device array.
    cudaError_t errcode;
    if (cudaSuccess != (errcode = cudaMalloc((void**)&d_ts, sizeof(ts))))
    {
        printf ("cudaMalloc failed %s:%d\n", __FILE__, __LINE__);
        printf ("   %s\n", cudaGetErrorString(errcode));
        return 0;
    }
    if (cudaSuccess != (errcode = cudaMalloc((void**)&d_out, 4)))
    {
        printf ("cudaMalloc failed %s:%d\n", __FILE__, __LINE__);
        printf ("   %s\n", cudaGetErrorString(errcode));
        return 0;
    }
    cudaGetLastError();

    dim3 Db = dim3(1);
    dim3 Dg = dim3(1,1,1);
    Dg.x = 14; //1 block per SM
    Db.x = 1;  //1 thread per block

    kclock_test2 <<<Dg, Db>>>(d_ts, d_out, 4, 6, 2);
    cudaThreadSynchronize();
    cudaMemcpy(ts, d_ts, sizeof(ts), cudaMemcpyDeviceToHost);
    for (int i=0; i < 14; i++)
        printf ("  Block %02d: start: %08u, stop: %08u, diff: %u\n", i, ts[i*2], ts[i*2+1], ts[i*2+1]-ts[i*2]);
    cudaFree (d_ts);
    cudaFree (d_out);
    return 0;
}

When I disassembled the code using cuobjdump (9.0) here is the relevant part of the SASS code:

S2R R8, SR_CLOCKLO;                        //SCHI: 0x2f  16 cycles
IADD R2, R2, R3;                           //SCHI: 0x28  9 cycles
IADD R3, R3, R2;                           //SCHI: 0x28  9 cycles
IADD R2, R2, R3;                           //SCHI: 0x28  9 cycles
IADD R3, R3, R2;                           //SCHI: 0x28  9 cycles
IADD R2, R2, R3;                           //SCHI: 0x28  9 cycles
IADD R3, R3, R2;                           //SCHI: 0x28  9 cycles
IADD R2, R2, R3;                           //SCHI: 0x28  9 cycles
IADD R3, R3, R2;                           //SCHI: 0x28  9 cycles
IADD R2, R2, R3;                           //SCHI: 0x2f  16 cycles
S2R R7, SR_CLOCKLO;                        //SCHI: 0x2f  16 cycles

From the stall counts of each instruction, the difference between R7 and R8 should be 104=16+9*8+16 theoretically. But when I run the code on Ubuntu 18.04 with TITAN, here is the result

Block 00: start: 01816203, stop: 01816307, diff: 104
  Block 01: start: 01815641, stop: 01815745, diff: 104
  Block 02: start: 01815448, stop: 01815552, diff: 104
  Block 03: start: 01815143, stop: 01815247, diff: 104
  Block 04: start: 01816237, stop: 01816341, diff: 104
  Block 05: start: 01815643, stop: 01815747, diff: 104
  Block 06: start: 01815446, stop: 01815550, diff: 104
  Block 07: start: 01815134, stop: 01815238, diff: 104
  Block 08: start: 01814428, stop: 01814570, diff: 142
  Block 09: start: 01816237, stop: 01816341, diff: 104
  Block 10: start: 01815644, stop: 01815791, diff: 147
  Block 11: start: 01815446, stop: 01815550, diff: 104
  Block 12: start: 01815143, stop: 01815247, diff: 104
  Block 13: start: 01814425, stop: 01814529, diff: 104

Here are my questions with my assumptions

I am launching 14 blocks because I assume each block will run on each SMX. From the result, the diff is 104 as expected from some SMs, but not true for others. Why different SMs behave differently?
Every time I run the code, I got different numbers for each SM. Does the clock() value depends on some runtime environment? Maybe the hardware does not respect the stall counts completely?
I learned that the clock reading instruction (S2R R1, SR_CLOCKLO) is a varying-latency instruction for cc<7.0. However, when I try to put two clock reading instruction back2back, I always get the same diff (16 cycles) between the two readings. Does it mean the latency of the clock reading instruction is always 16 cycles?

Many thanks in advance!!

mnicely · January 23, 2020, 12:18pm

Starting with your first question, I don’t believe, I assume each block will run on each SMX, is a true statement. If you check out https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities, you’ll see that you can have 16 resident blocks, 64 resident warps, and 2048 resident threads on a single CC 3.5 SM. Given that you’re launching a single thread per block and only 14 blocks, I don’t see a reason why all blocks couldn’t be ran on a single SM. I know blocks are distributed with a hardware scheduler, but I’m not sure about the logical behind the mechanism.

Also, you might want to take a look at the clock example in the CUDA Toolkit samples. There is some rationale provided in there as well.

Number 2, ~~I don’t believe there is a way to tell which SM a block was actually launched on~~. Please see Bob’s comment below.

The description in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#time-function is quite simple, leading me to believe there is not runtime dependency.

Number 3, where did you read that clock() had varying-latency? Not arguing the statement, just curious. I think it is quite possible that the GPU is doing so little work you’re not seeing a difference.

Robert_Crovella · January 23, 2020, 2:49pm

You can tell which block an SM was launched on. An SM has a unique id that is readable from a hardware special purpose register.

https://devtalk.nvidia.com/default/topic/481465/cuda-programming-and-performance/any-way-to-know-on-which-sm-a-thread-is-running-/1

lxzhang · January 23, 2020, 6:28pm

I tried to get the sm id for each block and here is the output

block 9 --> smid 4
block 4 --> smid 9
block 0 --> smid 13
block 10 --> smid 3
block 5 --> smid 8
block 1 --> smid 12
block 7 --> smid 6
block 12 --> smid 1
block 3 --> smid 10
block 13 --> smid 0
block 8 --> smid 5
block 6 --> smid 7
block 11 --> smid 2
block 2 --> smid 11

I have launched 14 blocks because TITAN has 14 SM’s. I expect each block to be scheduled on one SM individually. This is verified by the above output.
Sorry about the confusion. Let me ask my question again.

Why the time difference between two clock() calls differs from SM to SM?

I found the latency of clock() in the following post
https://devtalk.nvidia.com/default/topic/1056495/cuda-programming-and-performance/does-clock-measure-actual-gpu-cycles-or-what-/post/5356523/#5356523

Robert_Crovella · January 25, 2020, 3:14pm

There may possibly be things like instruction cache effects. I can’t be specific, but I found that if I launched the kernel many times, eventually the behavior across SMs became more comparable:

$ cat t1650.cu
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define repeat2(S) S S
#define repeat3(S) repeat2(S) S
#define repeat4(S) repeat2(S S)
#define repeat5(S) repeat4(S) S

static __device__ __inline__ int __mysmid(){
  int smid;
  asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
  return smid;}

__global__ void kclock_test2 (unsigned int *ts, unsigned int *out, unsigned int *sm_id, int p1, int p2, unsigned int its)
{
    unsigned int t1 = p1;
    unsigned int t2 = p2;
    unsigned int start_time = 0, stop_time = 0;
    unsigned int pid = blockIdx.x*blockDim.x + threadIdx.x;

    for (int i = 0; i < its; i++)
    {
        start_time = clock();
        repeat5(t1+=t2;t2+=t1;);
        stop_time = clock();
    }

    out[0] = t1+t2;
    ts[pid*2] = start_time;
    ts[pid*2+1] = stop_time;
    sm_id[blockIdx.x] = __mysmid();
}
int mpc(){

   cudaDeviceProp prop;

      cudaGetDeviceProperties(&prop, 0);

      printf(" Name: %s\n",prop.name );
      printf(" Compute capability: %d.%d\n", prop.major, prop.minor );
      printf(" Clock rate: %d\n",prop.clockRate );
      printf(" Total global memory: %ld (%d MB)\n", prop.totalGlobalMem, int(prop.totalGlobalMem*9.5367e-7));
      printf(" Multiprocessor count: %d\n", prop.multiProcessorCount);
      return prop.multiProcessorCount;
}
int main()
{
    //measure_clock();
    unsigned int ts[1024];
    unsigned int *d_ts;
    unsigned int *d_out;
    unsigned int *d_sm_id;
    int my_mpc = mpc();
    cudaMalloc(&d_sm_id, my_mpc*sizeof(d_sm_id[0]));
    // Allocate device array.
    cudaError_t errcode;
    if (cudaSuccess != (errcode = cudaMalloc((void**)&d_ts, sizeof(ts))))
    {
        printf ("cudaMalloc failed %s:%d\n", __FILE__, __LINE__);
        printf ("   %s\n", cudaGetErrorString(errcode));
        return 0;
    }
    if (cudaSuccess != (errcode = cudaMalloc((void**)&d_out, 4)))
    {
        printf ("cudaMalloc failed %s:%d\n", __FILE__, __LINE__);
        printf ("   %s\n", cudaGetErrorString(errcode));
        return 0;
    }
    cudaGetLastError();

    dim3 Db = dim3(1);
    dim3 Dg = dim3(1,1,1);
    Dg.x = my_mpc; //1 block per SM
    Db.x = 1;  //1 thread per block

    kclock_test2 <<<Dg, Db>>>(d_ts, d_out, d_sm_id, 4, 6, 2);
    kclock_test2 <<<Dg, Db>>>(d_ts, d_out, d_sm_id, 4, 6, 2);
    kclock_test2 <<<Dg, Db>>>(d_ts, d_out, d_sm_id, 4, 6, 2);
    kclock_test2 <<<Dg, Db>>>(d_ts, d_out, d_sm_id, 4, 6, 2);
    kclock_test2 <<<Dg, Db>>>(d_ts, d_out, d_sm_id, 4, 6, 2);
    int *h_sm_id = new int[my_mpc];
    cudaMemcpy(ts, d_ts, sizeof(ts), cudaMemcpyDeviceToHost);
    cudaMemcpy(h_sm_id, d_sm_id, my_mpc*sizeof(d_sm_id[0]), cudaMemcpyDeviceToHost);
    for (int i=0; i < my_mpc; i++)
        printf ("  Block %02d: start: %08u, stop: %08u, diff: %u, smid: %d\n", i, ts[i*2], ts[i*2+1], ts[i*2+1]-ts[i*2], h_sm_id[i]);
    cudaFree (d_ts);
    cudaFree (d_out);
    return 0;
}
$ nvcc -o t1650 t1650.cu -arch=sm_35
$ CUDA_VISIBLE_DEVICES="1" ./t1650
 Name: Tesla K20Xm
 Compute capability: 3.5
 Clock rate: 732000
 Total global memory: 5977800704 (5700 MB)
 Multiprocessor count: 14
  Block 00: start: 2521671771, stop: 2521671875, diff: 104, smid: 13
  Block 01: start: 204850824, stop: 204850928, diff: 104, smid: 12
  Block 02: start: 530193055, stop: 530193159, diff: 104, smid: 11
  Block 03: start: 4087280488, stop: 4087280592, diff: 104, smid: 10
  Block 04: start: 2521671768, stop: 2521671872, diff: 104, smid: 9
  Block 05: start: 204850858, stop: 204850962, diff: 104, smid: 8
  Block 06: start: 530193035, stop: 530193139, diff: 104, smid: 7
  Block 07: start: 4087280442, stop: 4087280546, diff: 104, smid: 6
  Block 08: start: 1560231018, stop: 1560231122, diff: 104, smid: 5
  Block 09: start: 2521671757, stop: 2521671861, diff: 104, smid: 4
  Block 10: start: 204850859, stop: 204850963, diff: 104, smid: 3
  Block 11: start: 530193035, stop: 530193139, diff: 104, smid: 2
  Block 12: start: 4087280486, stop: 4087280590, diff: 104, smid: 1
  Block 13: start: 1560231017, stop: 1560231121, diff: 104, smid: 0

It’s still not always like the above output, but initially my output was quite chaotic, and as I added more iterations to the kernel launches, it became more predictable.

And it also seems like the variable latency you mentioned already may be a plausible reason for some variation.

When I run the above code, with or without the extra kernel launches, on my Tesla V100 I get 56 across the board for diff. Therefore I think variable latency is a likely explanation.

lxzhang · January 25, 2020, 8:21pm

Thanks Robert. It helps a lot.

lxzhang · January 26, 2020, 12:50am

I compiled the same code with -arch=sm_61 and checked the sass code with cuobjdumo (10.0). I got several pairs of curly braces in the output sass code. The following code is the beginning of the sass code and it shows the braces at the end. Do you know what do them mean? I do not see braces with -arch=sm_35 but do see them with -arch=sm_50 and -arch=sm_60.

Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_61
		Function : _Z12kclock_test2PjS_iijS_
	.headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                                     /* 0x001fc400fe2007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                            /* 0x4c98078000870001 */
        /*0010*/                   ISETP.NE.AND P0, PT, RZ, c[0x0][0x158], PT ;      /* 0x4b6b03800567ff07 */
        /*0018*/                   MOV R4, c[0x0][0x154] ;                           /* 0x4c98078005570004 */
                                                                                     /* 0x001ff400fe0007fb */
        /*0028*/                   MOV R10, RZ ;                                     /* 0x5c9807800ff7000a */
        /*0030*/         {         IADD R4, R4, c[0x0][0x150] ;                      /* 0x4c10000005470404 */
        /*0038*/              @!P0 BRA 0x478         }

Topic		Replies	Views
clock() function on titanX CUDA Programming and Performance	22	2496	March 12, 2022
Does %clock measure actual GPU cycles, or what? CUDA Programming and Performance	5	1746	July 9, 2019
How to accurately time individual memory operations CUDA Programming and Performance	17	6500	September 12, 2016
Clock64() returns large increment value between simple operations CUDA Programming and Performance	5	139	September 4, 2024
Do different SMs of gpu use different clocks? CUDA Programming and Performance	3	576	September 19, 2023
cost of clock64() CUDA Programming and Performance	18	6687	November 23, 2015
clock64() reversed CUDA Programming and Performance	2	1770	May 11, 2016
Clock() and Clock64() Functions CUDA Programming and Performance cuda	10	2679	March 13, 2024
clock() doesn't work properly CUDA Programming and Performance	10	6454	July 3, 2009
Why is clock64 so unreliable CUDA NVCC Compiler	1	865	August 6, 2022

how does clock() work

Related topics