cudaEventElapsedTime() does not scale with repeated kernel calls

I want to time the execution time of a loop which repeatedly calls a kernel:

cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

for (int i=0; i<100; i++)
  kernel_call<<<dimGrid, dimBlock, 0>>>();

cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);

My expectation is that the reported time should scale nicely with the iteration count (here 100), but it does NOT – it remains constant regardless of the count.

What adds to my confusion is that, this behavior depends on the kernel code. For certain kernels, the time does scale linearly, but for my particular kernel (matrixMultiply from CUDA 7.0 Samples) it does not.

Can someone explain why? Thanks.

PS, I want to provide the complete source code, but I can’t seem to find a way to upload the file. Does anybody know how?
matrixMul.cu (10.9 KB)

Just uploaded the source code. It does matrix multiplication and reports time using two methods (clock() on CPU and cudaEventElapsedTime())

On my machine, if I change the # of iterations, cudaEventElapsedTime() always reports the same number, while time reported by clock() scales with iter #.

How to run:

matrixMul.exe -w=640 -iter=10  // iterate 10 times
matrixMul.exe -w=640 -iter=100  // iterate 100 times
matrixMul.exe -w=640 -iter=200  // iterate 200 times

I’d really appreciate if someone can try my code on your machine and tell me what you get.

You are printing out:

float msecTotal = 0.0f;
        error = cudaEventElapsedTime(&msecTotal, start, stop);
        ....
        float msecPerMatrixMul = msecTotal / nIter;
        printf("Time based on cudaEventElapsedTime() = %.3f msec\n", msecPerMatrixMul);

That calculates the time per iteration. If you want to see the total time (which would increase with increasing iterations,) then print out msecTotal.

It’s no surprise that msecPerMatrixMul doesn’t change.

OMG, I can’t believe I made such a naive mistake. Thanks, txbob.