I want to time the execution time of a loop which repeatedly calls a kernel:
cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0); for (int i=0; i<100; i++) kernel_call<<<dimGrid, dimBlock, 0>>>(); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop);
My expectation is that the reported time should scale nicely with the iteration count (here 100), but it does NOT – it remains constant regardless of the count.
What adds to my confusion is that, this behavior depends on the kernel code. For certain kernels, the time does scale linearly, but for my particular kernel (matrixMultiply from CUDA 7.0 Samples) it does not.
Can someone explain why? Thanks.
PS, I want to provide the complete source code, but I can’t seem to find a way to upload the file. Does anybody know how?
matrixMul.cu (10.9 KB)