Metric for number of threads launched

Hi
I want to get the exact number of launched threads with the profiler metric However with the following example, I don’t see that. Consider this vector addition code:

__global__ void vecAdd(double *a, double *b, double *c, int n) {
    // Get our global thread ID
    int id = blockIdx.x * blockDim.x + threadIdx.x;

    // Make sure we do not go out of bounds
    if (id < n) c[id] = a[id] + b[id];
}
...
    // Number of threads in each thread block
    blockSize = 1024;

    // Number of thread blocks in grid
    gridSize = (int)ceil((float)n / blockSize);

    // Execute the kernel
    CUDA_SAFECALL((vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n)));

For ./vectorAdd 10000 I see

    launch__block_size                    1,024
    launch__grid_size                     10
    launch__thread_count      thread      10,240
    sm__ctas_launched.sum                 10

Here the thread count is block*grid which is correct. However, due the number of input elements, there are additional 240 threads based on the stats and due to the “if” statement they are not used. So, I want to know if there is any metric in the profiler that shows 10000 as real thread count?

The profiler can only show what is submitted to the CUDA API. In this case the CUDA grid is {10,1,1} x (1024,1,1}. How would the CUDA driver or the GPU know the number 10000?

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.