clock() function on titanX

Using clock() function on titanX is not working as expected. Code shown below is the back 2 back clock() functions to capture the latency. When i run the kernel start_time is ahead of stop_time which is not usual. So, I added printf after the clock() and the code works fine. I am not sure if this is a known issue or if this is caused because of very small kernel exiting soon.

Code not working:

unsigned int start_time = 0, stop_time = 0;
	start_time = clock();	
	stop_time = clock();

Code working:

unsigned int start_time = 0, stop_time = 0;

	start_time = clock();	
	stop_time = clock();

    printf("Clock %d %d\n", start_time, stop_time);

Please let know if anyone faced similar issue or

clock() can roll over

you may also be experiencing a compiler optimization/reordering effect

finally, it’s possible in some circumstances (dynamic parallelism) for the clock function to give spurious results, because threadblocks can be rescheduled from one SM to another.

Thanks txbob, the kernel configuration is just a single thread I don’t think rescheduling/ reordering matters. Because I checked the ptx and assembly and the order of instructions look good. I also ran this on Fermi and Kepler architecture, the same code works fine without printf.

Dynamic parallelism, I thought about it but and I checked the assembly function and there is no function call to be a dynamic parallelism. It’s just a register move operation to get the clock().

Clock roll over:
I tried with ‘long long int’ and clock64() function. Even this gave me similar results. When I add the printf the time looks good.

Please let me know if u think something else could be causing the problem.

just one possible trick:

start = clock();
x = computation();
if (x!=42) stop = clock();

BulatZiganshin, This is exactly why I was using the clock(), to estimate latency of some computation. In my case, there are 256 dependent add instructions. But the latency for these 256 instructions is just 8 cycles, which I think is not expected. I tried printing the start and stop times but this doesn’t work this time. It still reports 8 cycles.

Or maybe you’ve just made a mistake in your code in the “failing” case. It’s possible that the problem is outside of the code you have shown. You might get better responses if you provide a short, complete code that demonstrates the issue.

__global__ void kclock(unsigned int *ts)
{
	unsigned int start_time = 0, stop_time = 0;

	start_time = clock();
	
	stop_time = clock();

    //printf("Clock %d %d\n", start_time, stop_time);
	
	ts[(blockIdx.x*blockDim.x + threadIdx.x)*2] = start_time;
	ts[(blockIdx.x*blockDim.x + threadIdx.x)*2+1] = stop_time;
}

This is the kernel. I also added the kernel call below.

kclock <<<Dg, Db>>> (d_ts);
	cudaThreadSynchronize();
	cudaMemcpy(ts, d_ts, sizeof(ts), cudaMemcpyDeviceToHost);

The values printed from the cudaMemcpy data are not correct.

“if (x!=42)” is the key part, this ensures that the second assignment will be performed after computation of x

Its due to instruction reordering by the compiler. If you think it is not, please present your analysis of the SASS code.

If you want further help from me, you’ll need to provide a complete example:

  1. A complete code that I can copy, paste, compile, and run, without having to add anything or change anything.
  2. Your actual output data.
  3. A description of your platform: GPU, OS, CUDA version, compile command

Here is my test case, on CUDA 7.5, Fedora 20, GTX960, borrowing the technique proposed by BulatZiganshin:

$ cat t952.cu
#include <stdio.h>


__global__ void kclock(unsigned int *ts)
{
  unsigned int start_time = 0, stop_time = 0;

  start_time = clock();
#ifdef FIX
  if (start_time != 42)
#endif
  stop_time = clock();

    //printf("Clock %d %d\n", start_time, stop_time);

  ts[(blockIdx.x*blockDim.x + threadIdx.x)*2] = start_time;
  ts[(blockIdx.x*blockDim.x + threadIdx.x)*2+1] = stop_time;
}

int main(){

        int Dg = 1, Db = 1;
        unsigned int *ts, *d_ts;
        ts=(unsigned int *)malloc(Dg*Db*2*sizeof(unsigned int));
        cudaMalloc(&d_ts, Dg*Db*2*sizeof(unsigned int));
        kclock <<<Dg, Db>>> (d_ts);
  cudaDeviceSynchronize();
  cudaMemcpy(ts, d_ts, sizeof(ts), cudaMemcpyDeviceToHost);
        printf("%u, %u\n", ts[0], ts[1]);
        return 0;
}
$ nvcc -arch=sm_52 -o t952 t952.cu
$ ./t952
419170408, 419169995
$ nvcc -DFIX -arch=sm_52 -o t952 t952.cu
$ ./t952
420292229, 420292254
$

With the code modification, the 2nd value is greater than the first

I am sorry txbob for not providing all the code.
I will try the “if (x!=42)” part BulatZiganshin.

Thanks

  1. even better, you should include dependency on both ends of chain:
start = clock()
if (start!=42) x = 1
x += y
if (x!=42) stop = clock()

this ensures that first clock() will be called before additions and second one - after additions. but even in this sequence additions may reordered due to Math laws

  1. optimizer can replace 256 additions with single multiplication. you need to take special care to prohibit that

overall, correct and meaningful benchamrking is knowledge area of its own. it’s full of traps… :(

at least, start with providing all info listed by txbob, and try to look yourself into SASS generated

1 Like

Adding the condition worked.
Thanks, BulatZiganshin and txbob.

Just a later thread saying thanks, this helped me get on the right track.

In my case, I observed the following, I had:

start clock
big body of code (sprinkled with middle clock timings)
end clock

I observed timings that didn’t make much sense, all the clocks were within a few ticks of each other. So I compiled it to cubin and used the sass flag to look at the assembly, and the code matched. I was baffled. Finally after compiling to cubin with the parameter for 5.2 compute capability, I observed different assembly, it was doing this:

big body of code
sprinkled middle clock timing attempts, with start clock mixed in there
end clock

It seems due to a lack of dependencies, nvcc just reordered non-dependent code and threw all the clocks at the bottom.

I had to force a dependency on the top and the bottom as BulatZiganshin suggested. My dependent computing was j, so I had:

if (j)
start clock
big body of code computing j, sprinkled with (if j) run middle clocks
if (j)
end clock

The assembly showed it finally kept the clocks from reordering. A downside of all this is it’s harder to measure a single instruction. In my case I’m measuring 1024 instructions, so the overhead of the timing is minimal. I suppose the only way around this is compiling in debug mode, usually that keeps existing code order? Anyway, I’m back on track, and I don’t think I would have got here without this threads help. Thanks again!

Even a simpler way is to add

__syncthreads() after the start clock()
__syncthreads() before the stop clock()

This should make sure the clock() are not rearranged

what you want to measure with a single instruction? delay? throughput? GPU executes instructions like conveyour builds cars, so before measuring “car assembling time” you need to decide what exactly you trying to measure.

Can you expand a bit on thread-block rescheduling? This other answer gives me the impression it is not a thing. What am I getting wrong?

It’s a fairly reliable mental model that threadblocks, once resident on a SM, will stay there until they retire.

However there are a few circumstances that I know of where a threadblock may be removed from a SM, and then rescheduled later, perhaps on another SM.

The general possibility of the CUDA runtime to do this is mentioned in the programming guide here

Possible scenarios:

This mechanism isn’t documented in any great detail that I am aware of, so I’m unlikely to be able to respond to further questions about it.

Another experimental example of threadblock preemption is given here relative to CUDA MPS, when the experiment is run on newer (cc6.0 and beyond) GPUs.

There is probably a distinction between a threadblock getting preempted and restored on the same SM it was on, and a threadblock getting preempted and restored on a different SM compared to the one it was previously on. I’m not differentiating between these two cases. The programming guide reference I previously gave with respect to dynamic parallelism suggests that at least in the CDP case, it is possible for threadblocks to be suspended and then restored to different SMs.

I wonder if I just hit one of those circumstances.

Here is my story…


I have a dim3 grid. And I am clocking threadIdx.x == 0 of each block like this

namespace {

[...]

template<typename scalar_t>
__global__ void calculate_my_operation_kernel (
    const scalar_t* __restrict__ x, const scalar_t* __restrict__ h,
    scalar_t* __restrict__ output, const int method_idx,
    long int* __restrict__ block_clock,
    const int THREADS, const int C, const int PLANE_SIZE, const int PLANE_STEP,
    const int INPUT_STEP, const int OUTPUT_STEP, const int CELL_STEP
    ) {

    const int b = blockIdx.x;
    const int f = blockIdx.y;
    const int cell_idx = blockIdx.z * THREADS + threadIdx.x; // Position in PLANE

    if(cell_idx >= PLANE_SIZE) return; // Ensures thread inside of PLANE_SIZE

    // Indexes for memory addresses
    const int x_idx_base = b*INPUT_STEP + cell_idx*CELL_STEP;
    const int h_idx_base = f*INPUT_STEP + cell_idx*CELL_STEP;

    // Here we create a dependency for calling the clock function before and
    // after the operations we want to time. Having both of these ties is recommended.
    // See https://forums.developer.nvidia.com/t/clock-function-on-titanx/44807/11
    scalar_t out_; int c;
    long int start = clock();
    if (start != 0) out_ = 0.0; // Dependency on start

    for (c = 0; c < C; c++) {
        const int x_idx = x_idx_base + c*PLANE_STEP;
        const scalar_t x_ = x[x_idx];

        const int h_idx = h_idx_base + c*PLANE_STEP;
        const scalar_t h_ = h[h_idx];

        my_operation<scalar_t>[method_idx](x_, h_, &out_);
    }

    long int diff_;
    if (c == C) diff_ = clock() - start; // Dependency on stop

    const int output_idx = b*OUTPUT_STEP + f*PLANE_STEP + cell_idx*CELL_STEP;
    output[output_idx] = out_;

    // We just really care for the clock of one of the threads within a block.
    const auto step_row = gridDim.z;
    const auto step_plane = gridDim.y * gridDim.z;
    if (threadIdx.x == 0) {
        auto clock_idx = step_plane * blockIdx.x + step_row*blockIdx.y + blockIdx.z;
        block_clock[clock_idx] = diff_;
    }
}
} /** End of namespace */

Now, given the code above is from a torch CUDA kernel, and I have the data represented in torch::kLong as defined in the host function:

[...]

const auto Z = (H*W + THREADS - 1)/THREADS;
    const dim3 GRID_SIZE(B, F, Z);

    auto block_clock = torch::zeros({B, F, Z},
        torch::TensorOptions().dtype(torch::kLong).device(
            x.device().type(), x.device().index()
            )
    );

[...]

When visualizing the clock per block, I first get all clocks as a single numpy array:

 # Clock shape is [128, 128, 16]  for the picture below.
 data = clock.cpu().numpy().view(np.uint64).reshape(-1,)

Then when visualizing it as a box plot, I have a max clock that is extremely outstanding, several orders of magnitude difference.

Screen Shot 2022-03-10 at 2.02.11 PM


So, would this discrepancy in clock be due to one of those circumstances? Or I am casting the unsigned int values improperly somewhere?

Thank you for your time.

Let’s simplify:

  1. are you debugging (that is, running this code under cuda-gdb) ?
  2. are you running this code on a machine that is also running a graphical desktop display using the same GPU you are running your code on?
  3. Does your code employ CUDA dynamic parallelism?
  4. Are you running this code on a single GPU, while employing MPS, with other clients running CUDA codes on the same GPU?

My guess is the answer to all these is “no”. So they are unlikely to be an explanation for what you are seeing.

clock() returns a 32-bit type, which can rollover. The base clock of the GPU is something on the order of 1GHz. A 32-bit type, keeping track of a 1GHz clock, will rollover in ~5 seconds or less. We haven’t gone into the details of exactly when does this 5-second max count begin, but still it should immediately give cause for interest/concern.

To rule out rollover, I would immediately switch to the clock64() function given at the previous link, and do all my accounting using 64-bit integer arithmetic, rather than 32-bit.

To account for the possibility that threadblocks can be preempted, (although as I’ve already stated, I don’t think that is a likely explanation here), one could switch to the more recently available PTX globaltimer function. This timer is coherent across all SMs. (Using this timer doesn’t sort out pre-emption for you, it simply means you’re not going to get weird results like negative times or rollover.)

If switching to clock64() doesn’t help, then I would investigate your code carefully. Nothing jumps out at me, but I haven’t studied it carefully, and anyway I wouldn’t try to analyze code that has torch wrapped around it.

Thank you so much for pointing that out! I am using now the clock64() as you said.

Screen Shot 2022-03-11 at 2.36.55 PM

A couple things got my attention.

  • Type of the data structure to store the clock.

    • I changed these lines in my code to ensure I using 64bit data types.
      [...]
      long long int start = clock64();
      [...]
      long long int diff_;
      if (c == C) diff_ = clock64() - start; // Dependency on stop
      [...]
      
    • Since I store diff_ into an array (block_clock), I tried having that array’s pointer as type long long int*. However, by doing so, despite having a successful compilation, I got an importing error
      ImportError: /home/edreis/miniconda3/lib/python3.7/site-packages/my_extension-0.0.0-py3.7-linux-x86_64.egg/my_extension_cuda.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNK2at10TensorBase8data_ptrIxEEPT_v
      
      which seems to be an issue related to the PyTorch wrapper. Since block_clock only handles the difference in clock I kept it as long int *. Then, it came to my mind the following question: What is the size of long int and long long int in CUDA? Would they be the same, by any change?
  • Given the plot above, my biggest question now is: My code is capturing the clock diff. on the threadIdx.x == 0 of each block. Why would they differ that much? I was expecting that to be pretty much the same value. Would that be due to memory access?

Thank you so much for your attention and help.