Hi
I want to get the exact number of launched threads with the profiler metric However with the following example, I don’t see that. Consider this vector addition code:
__global__ void vecAdd(double *a, double *b, double *c, int n) {
// Get our global thread ID
int id = blockIdx.x * blockDim.x + threadIdx.x;
// Make sure we do not go out of bounds
if (id < n) c[id] = a[id] + b[id];
}
...
// Number of threads in each thread block
blockSize = 1024;
// Number of thread blocks in grid
gridSize = (int)ceil((float)n / blockSize);
// Execute the kernel
CUDA_SAFECALL((vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n)));
For ./vectorAdd 10000
I see
launch__block_size 1,024
launch__grid_size 10
launch__thread_count thread 10,240
sm__ctas_launched.sum 10
Here the thread count is block*grid which is correct. However, due the number of input elements, there are additional 240 threads based on the stats and due to the “if” statement they are not used. So, I want to know if there is any metric in the profiler that shows 10000 as real thread count?