While working on batched gemm (CUTLASS example here) and nsight, I have seen that for

```
int const m = 4096;
int const n = 4096;
int const k = 4096;
int const batch_count = 1;
```

the number of thread instructions `smsp__thread_inst_executed.sum`

is 86,827,335,680.

However, for

```
int const m = 1024;
int const n = 1024;
int const k = 1024;
int const batch_count = 16;
```

the number of thread instructions `smsp__thread_inst_executed.sum`

is 21,899,509,760.

Both of them are multiplying two 4096x4096 matrices and count values are 16,777,216 which means 4096*4096 elements in each matrix.

Is there any explanation about the difference? Or I am misunderstanding the matrices sizes.