How to get the exec. time inner the kernel function?

I want to get the exec. time of different parts (code lines) in kernel function , is there any good ways ?

I have tried using clock64() function in kernel. I have add timstamp, then get duration time of only one certain.

__global__ void reduce2(int* d_in, int* d_out) {
    __shared__ int sdata[THREAD_PER_BLOCK];
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int tid = threadIdx.x;

    clock_t t1, t2;
    if(i==TID) {
        t1 = clock64();
    }

    sdata[tid] = d_in[i];
    __syncthreads();
    // reduce

    // 消除banck conflict
    // s=128时, th0访问data[0] & data[128], th1访问data[1] & data[129], th31访问data[31] & data[31+128]
    for(unsigned int s=blockDim.x/2; s>0; s>>=1){
        if(tid < s){
            sdata[tid]+=sdata[tid+s];
        }
        __syncthreads();
    }

    if(tid==0) {
        d_out[blockIdx.x] = sdata[0];
    }

    if(i==TID) {
        t2 = clock64();
        float time_cost = (float)(t2 - t1)/CLOCK_RATE_KHZ;
        printf("time cost=%fms\n", time_cost);
    }
}

The CLOCK_RATE_KHZ is obtained by here:

    // A100, GPU Boost Clock=1410MHz
    cudaDeviceProp prop;
    CHECK_ERROR(cudaGetDeviceProperties(&prop, 1));
    clock_t clock_rate = prop.clockRate; // Clock frequency in kilohertz
    std::cout << "clock_rate(kHz): " << clock_rate << std::endl;

main function: (GPUTImer is implemented by cudaEvent)

    GPUTimer gputimer;
    gputimer.start();
    reduce2<<<blocks_per_grid, threads_per_block>>>(d_a, d_out);
    CHECK_ERROR(cudaGetLastError());
    gputimer.stop();
   printf("GPU time: %fms\n", gputimer.elapsed_ms());

Here is my result:
clock_rate(kHz): 1410000
time cost=0.002268ms (inside kernel time, one thread)
GPU time: 11.432960ms (whole kernel dur time)

the 0.002268ms is much less than whole kernel dur time. This there any wrong ?

The timer code you have written would work for timing a single thread, however the GPU execution engine may require multiple waves of threads to execute the code one after another if the hardware can’t do it all in one pass. Conceptually, with so much parallelism going on, including multiple waves of threads, it’s not straightforward to just time an arbitrary piece of code within a kernel.

1 Like

As was stated, there is no good method to time code segments in CUDA device work in general kernels. Nsight Compute can collect statistical warp state data which can be used as an approximation for which instructions or function were executed “the longest” (in terms of cycles, rather than time). You can collect the full set or the SourceCounters section to get these and then inspect the “Warp Stall Sampling (All Cycles)” metric on the Source in the UI. If your kernel uses multiple device functions, you can collapse this page to get a per-function aggregate, too.

These sets/sections also show various instruction execution metrics, which show a similar aspect of your kernel execution and can be combined with the sampling data for additional insight. You may see the metric reference for all metrics on that page.

1 Like

Thank you very much.

  1. In the Source UI, it contains the CUDA-C and SASS source code and the “Warp Stall Sampling (All Cycles)” of each line. I wonder how Nsight get the cycles, the value of cycles (Warp Stall Sampling) is get from random one warp or average warp?

  2. “If your kernel uses multiple device functions, you can collapse this page to get a per-function aggregate, too”. Means even the sub device functions invoked in kernel can be seen in source UI ?

You should find the information in the documentation I had linked earlier.

Means even the sub device functions invoked in kernel can be seen in source UI ?

You will be able to see both global and device functions. in the SASS view.

Very nice, I see

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.