Getting Different Execution Times of Running Same Kernel Twice

ayazhk · August 13, 2024, 7:09am

I am getting different timings (shown as bold in the results below) of the same kernel when executed twice, any idea why this happens?

Results:

hepta iterations = 172, err = 774.110537
CPU Execution time: 4943.710000 ms
cuda hepta iterations = 172, err = 774.110537
GPU Execution time: 200.432632 ms
Correct!
cuda hepta iterations = 172, err = 774.110537
GPU second Execution time: 170.358490 ms
Correct!

Following is the code snippet for kernel execution:

TYPE *b, *A_hepta, *x_hepta, *x_hepta_cuda;
cudaCheck(cudaMallocManaged(&A_hepta, n * 7 * sizeof(TYPE)), __LINE__);
cudaCheck(cudaMallocManaged(&b, n * sizeof(TYPE)), __LINE__);
x_hepta = (TYPE*) malloc(n * sizeof(TYPE));
cudaCheck(cudaMallocManaged(&x_hepta_cuda, n * sizeof(TYPE)), __LINE__);

init(A_hepta, b, x_hepta, x_hepta_cuda, n);

int deviceID;
cudaGetDevice(&deviceID);

cudaCheck(cudaMemPrefetchAsync(A_hepta, n * 7 * sizeof(TYPE), deviceID));
cudaCheck(cudaMemPrefetchAsync(b, n * sizeof(TYPE), deviceID));
cudaCheck(cudaMemPrefetchAsync(x_hepta_cuda, n * sizeof(TYPE), deviceID));

struct timeval start, end;
nvtxRangePushA("bicgstab_hepta");
gettimeofday(&start, NULL);
bicgstab_hepta(A_hepta, b, x_hepta, n);
gettimeofday(&end, NULL);
double elapsedTime = (end.tv_sec - start.tv_sec) * 1000.0; // sec to ms
elapsedTime += (end.tv_usec - start.tv_usec) / 1000.0; // us to ms
nvtxRangePop();
printf("CPU Execution time: %.6f ms\n", elapsedTime);

cudaEvent_t cuda_start, cuda_stop;
cudaEventCreate(&cuda_start);
cudaEventCreate(&cuda_stop);
float milliseconds = 0;

nvtxRangePushA("bicgstab_hepta_cuda");
cudaEventRecord(cuda_start);
**bicgstab_hepta_cuda(A_hepta, b, x_hepta_cuda, n);**
cudaCheck(cudaDeviceSynchronize(), __LINE__);
cudaEventRecord(cuda_stop);
cudaEventSynchronize(cuda_stop);
milliseconds = 0;
cudaEventElapsedTime(&milliseconds, cuda_start, cuda_stop);
nvtxRangePop();
printf("GPU Execution time: %.6f ms\n", milliseconds);

check_result(x_hepta, x_hepta_cuda, n);

//Reallocate and ReInitialize CUDA Memories
cudaFree(A_hepta);
cudaFree(b);
cudaFree(x_hepta_cuda);

cudaCheck(cudaMallocManaged(&A_hepta, n * 7 * sizeof(TYPE)), __LINE__);
cudaCheck(cudaMallocManaged(&b, n * sizeof(TYPE)), __LINE__);
cudaCheck(cudaMallocManaged(&x_hepta_cuda, n * sizeof(TYPE)), __LINE__);

init(A_hepta, b, x_hepta, x_hepta_cuda, n, GPU_INIT);

cudaCheck(cudaMemPrefetchAsync(A_hepta, n * 7 * sizeof(TYPE), deviceID));
cudaCheck(cudaMemPrefetchAsync(b, n * sizeof(TYPE), deviceID));
cudaCheck(cudaMemPrefetchAsync(x_hepta_cuda, n * sizeof(TYPE), deviceID));

nvtxRangePushA("bicgstab_hepta_second");
cudaEventRecord(cuda_start);
**bicgstab_hepta_cuda(A_hepta, b, x_hepta_cuda, n);**
cudaCheck(cudaDeviceSynchronize(), __LINE__);
cudaEventRecord(cuda_stop);
cudaEventSynchronize(cuda_stop);
milliseconds = 0;
cudaEventElapsedTime(&milliseconds, cuda_start, cuda_stop);
nvtxRangePop();
printf("GPU second Execution time: %.6f ms\n", milliseconds);

cudaEventDestroy(cuda_start);
cudaEventDestroy(cuda_stop);

check_result(x_hepta, x_hepta_cuda, n);

free(x_hepta);
cudaFree(A_hepta);
cudaFree(b);
cudaFree(x_hepta_cuda);

njuffa · August 13, 2024, 7:39am

This is a perfectly normal observation. A best practice of benchmarking is to never measure anything on the first pass, when hardware and software structures (e.g. caches and buffers of various sorts) are “cold”. Instead, run a few “warmup” passes first, then measure performance once the system has reached “steady state”.

Some benchmarks use a simple approach of running the code under test ten times and then report the timing of the fastest pass. In my experience, this is particularly useful for memory-intensive workloads that tend to have the widest variations run-to-run. Another simple approach is to run three passes, with two serving as warmup while the timing of the third pass is reported. That can work well for simple codes, but in my experience steady state is no always reached with two warmup passes.

So you may need to experiment a bit to see how many passes are needed before steady state is reached.

system · March 24, 2025, 8:48am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Difference in running time CUDA Programming and Performance	6	1228	July 27, 2015
Execution timings varying from instance to instance CUDA Programming and Performance	10	575	September 29, 2023
Inconsistent kernel run times CUDA Programming and Performance	12	5914	August 5, 2009
Why the measure time for second kernel is extremely short? CUDA Programming and Performance	5	90	May 13, 2025
Function executing time CUDA Programming and Performance	7	6512	December 17, 2007
Timing Issue CUDA Programming and Performance	1	874	May 31, 2010
Strange Performance Issues Strange Performance Issues at the First Kernel Execution CUDA Programming and Performance	1	882	August 8, 2009
SPMT: Single Program Multiple (Exeuction) Time CUDA Programming and Performance	15	4043	July 4, 2009
kernel execution time not constant CUDA Programming and Performance	2	1396	March 1, 2017
Inconsistent CUDA Kernel Execution Times in Sequential Execution CUDA Programming and Performance cuda	6	436	June 11, 2024

Getting Different Execution Times of Running Same Kernel Twice

Results:

Following is the code snippet for kernel execution:

Related topics