First kernel execution takes longer

dimoff · December 7, 2014, 9:47pm

Hi all

Going straight to my problem. When I run my program, the measured time for the first 1-3 kernel calls is noticeably longer than for the rest. Do you have any idea what’s the cause for this?

Main:

#include "stdafx.h"
#include <iostream>
#include <ctime>
#include <cuda.h>
#include <cuda_runtime.h>
#include "ClosestNeighbor1.h"
#include "ClosestNeighbor1.cu"

using namespace std;

int main()
{
	// Number of points
	const int count = 10000;
	
	// Arrays of points
	int *indexOfClosest = (int *)malloc(count * sizeof(int));
	float3 *points = (float3 *)malloc(count * sizeof(float3));

	float3 *d_points;
	int *d_indexOfClosest;

	cudaMalloc(&d_points, count * sizeof(float3));
	cudaMalloc(&d_indexOfClosest, count * sizeof(int));

	// Create a list of random points
	for(int i = 0; i < count; i++)
	{
		points[i].x = (float)((rand()%10000) - 5000);
		points[i].y = (float)((rand()%10000) - 5000);
		points[i].z = (float)((rand()%10000) - 5000);
	}

	cudaMemcpy(d_points, points, sizeof(float3) * count, cudaMemcpyHostToDevice);

	// This var is used to keep track of the fastest time so far
	long fastest = 1000000;

	// Run the algorithm 20 times
	for(int q = 0; q < 20; q++)
	{
		long startTime = clock();

		// Run the algorithm
		//FindClosestCPU(points, indexOfClosest, count);

		FindClosestGPU<<<(count / 1024) + 1, 1024>>>(d_points, d_indexOfClosest, count);
		cudaMemcpy(indexOfClosest, d_indexOfClosest, sizeof(int) * count, cudaMemcpyDeviceToHost);

		long finishTime = clock();

		cout << "Run " << q <<" took " << (finishTime - startTime) << " ms." << endl;

		// If that run was faster update the fastet time so far
		if((finishTime - startTime) < fastest)
			fastest = (finishTime - startTime);
	}

	// Print out the fastest time 
	cout << "Fastest time: " << fastest << endl;

	// Print the final results to screen
	cout << "Final results:" << endl;
	for(int i = 0; i < 10; i++)
		cout << i << "." << indexOfClosest[i] << endl;

	// Deallocate RAM
	free(points);
	free(indexOfClosest);

	cudaFree(d_indexOfClosest);
	cudaFree(d_points);

	cudaDeviceReset();

	return 0;
}

Kernel

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>

__global__ void FindClosestGPU(float3* points, int* indices, int count)
{
	if(count <= 1) return;

	int idx = threadIdx.x + blockIdx.x * blockDim.x;
	if(idx < count)
	{
		float3 thisPoint = points[idx];
		float smallestSoFar = 3.40282e38f;

		for(int i = 0; i < count; i++)
		{
			if(i == idx) continue;

			float dist = (thisPoint.x - points[i].x) * (thisPoint.x - points[i].x);
			dist += (thisPoint.y - points[i].y) * (thisPoint.y - points[i].y);
			dist += (thisPoint.z - points[i].z) * (thisPoint.z - points[i].z);

			if(dist < smallestSoFar)
			{
				smallestSoFar = dist;
				indices[idx] = i;
			}
		}
	}
}

njuffa · December 7, 2014, 10:02pm

If you are running this on Windows with the default WDDM driver, measurements may be erratic due to the inherent high overhead of the WDDM driver model and the launch batching the CUDA driver performs to work around that.

Your observation could also be evidence of artifacts caused by the measurement methodology. Have you compared kernel execution times as reported by the profiler? That is the best way to evaluate kernel execution times. If you need to measure from the host, I would suggest the following framework for measuring kernel execution times from the host:

dummy = timer(); // make sure timer code is “warmed up”
cudaDeviceSynchronize(); // ensure all previous GPU work has finished on the current device
start = timer();
kernel<<<…>>>(…)
cudaDeviceSynchronize(); // ensure that kernel has finished
stop = timer();
elapsed = stop - start;

timer() should be based on a high-resolution system timer, not clock() which has low resolution. On Linux, gettimeofday() is recommended. Note that the above sequence will still include the overhead for the call to cudaDeviceSynchronize(), which is about 15 usec in recent versions of CUDA if I recall correctly.

Robert_Crovella · December 8, 2014, 4:07am

When I run your code on linux, modified to use gettimeofday() instead of clock(), these are the results I get:

$ ./t625
Run 0 took 8.951 ms.
Run 1 took 8.931 ms.
Run 2 took 8.93 ms.
Run 3 took 8.927 ms.
Run 4 took 8.929 ms.
Run 5 took 8.93 ms.
Run 6 took 8.927 ms.
Run 7 took 8.928 ms.
Run 8 took 8.928 ms.
Run 9 took 8.926 ms.
Run 10 took 8.929 ms.
Run 11 took 8.926 ms.
Run 12 took 8.93 ms.
Run 13 took 8.929 ms.
Run 14 took 8.929 ms.
Run 15 took 8.926 ms.
Run 16 took 8.928 ms.
Run 17 took 8.926 ms.
Run 18 took 8.927 ms.
Run 19 took 8.925 ms.
Fastest time: 8.925
Final results:
0.9976
1.7568
2.2987
3.3917
4.4618
5.9785
6.2094
7.9354
8.2397
9.6021
$

alexgg · December 8, 2014, 4:33am

I thought the first call was supposed to be slower, because of JIT? (Except in this case, one is probably just measuring PCI bandwidth or latency)

BTW (from the example), how come nvcc doesn’t accept “using namespace std;” ?

CudaaduC · December 8, 2014, 5:50am

Not getting the same problem with Windows 7 64 bit WDDM driver using that exact code;

using CUDA 6.0 GTX 780ti, compute 3.5

Run 0 took 4 ms.
Run 1 took 4 ms.
Run 2 took 4 ms.
Run 3 took 3 ms.
Run 4 took 4 ms.
Run 5 took 4 ms.
Run 6 took 4 ms.
Run 7 took 4 ms.
Run 8 took 3 ms.
Run 9 took 4 ms.
Run 10 took 4 ms.
Run 11 took 4 ms.
Run 12 took 4 ms.
Run 13 took 4 ms.
Run 14 took 4 ms.
Run 15 took 4 ms.
Run 16 took 4 ms.
Run 17 took 4 ms.
Run 18 took 4 ms.
Run 19 took 4 ms.
Fastest time: 3
Final results:
0.1982
1.8115
2.4738
3.1162
4.4517
5.3983
6.4263
7.3513
8.9489
9.7961

same machine with CUDA 6.5, GTX 980 , compute 5.2

Run 0 took 9 ms.
Run 1 took 9 ms.
Run 2 took 9 ms.
Run 3 took 9 ms.
Run 4 took 9 ms.
Run 5 took 9 ms.
Run 6 took 9 ms.
Run 7 took 10 ms
Run 8 took 8 ms.
Run 9 took 9 ms.
Run 10 took 9 ms
Run 11 took 9 ms
Run 12 took 9 ms
Run 13 took 9 ms
Run 14 took 9 ms
Run 15 took 9 ms
Run 16 took 9 ms
Run 17 took 9 ms
Run 18 took 9 ms
Run 19 took 9 ms
Fastest time: 8
Final results:
0.1982
1.8115
2.4738
3.1162
4.4517
5.3983
6.4263
7.3513
8.9489
9.7961

Those were both ‘cold starts’…

I wonder why the big difference between those two GPU compiled under different versions of CUDA?

CudaaduC · December 8, 2014, 7:00pm

Was anybody else able to reproduce this issue in Windows? I never get a huge difference the first and subsequent runtime unless I call via a MATLAB mex, and that is a MATLAB specific problem.

In general I have never had any WDDM related issues with the GTX line when compared to Tesla(TCC)., at least none which have made a significant difference. I work with both of them on a daily basis.

Was this an issue mainly with the older GPUs and older versions of CUDA?

njuffa · December 8, 2014, 7:34pm

The issue doesn’t repro for me (Windows 7, latest drivers). It may be specific to the original posters environment. I doubt it is GPU specific, but there is no way of knowing for sure without trying.

Generally speaking, the CUDA driver has been optimized in recent years to reduce various overheads. There was a note regarding specific improvements in the CUDA 6.5 time frame: either in the release notes, the performance brief, or Mark Harris’s blog post. Despite these improvements, some basic overhead issues and performance artifacts due to launch batching under WDDM still seem to exist (e.g. launch times for an individual kernel still seems to vary fairly widely, say between 5 to 25 usec) while there is almost no variation seen on Linux.

[Later:]
As far as CUDA 6.5 soecific launch overhead reduction goes, I was apparently thinking of the MPS improvements documented in the performance brief (obviously unrelated to WDDM):
[url]http://developer.download.nvidia.com/compute/cuda/6_5/rel/docs/CUDA_6.5_Performance_Report.pdf[/url]

dimoff · December 8, 2014, 9:00pm

Surprisingly enough, when I run the code through the Nvidia Profiler, I get following times:

And a subsequent run with Visual Studio produces nearly the same results:

But as soon as I close the profiler and Visual Studio and then start it again, it produces the results from original post.

Robert_Crovella · December 8, 2014, 9:53pm

This granularity you are showing now is due to ordinary timing granularity of the clock() function:

https://randomascii.wordpress.com/2013/07/08/windows-timer-resolution-megawatts-wasted/

This can be overridden in any app using a function such as timeBeginPeriod().

Windows will use the minimum period requested (as low as 1ms) by any currently running app.

I’m not sure why you got 1ms resolution prior to this - it may have something to do with your project setup. I’m also not sure why CudaaduC got 1ms resolution either. But I’m not a windows expert.

My guess is that in your current run, however, the lead-in variability is gone.

In retrospect, 30+ms variation is probably far too high to attribute to WDDM issues, so I withdraw that comment I made. I have no idea what the source of that variation is. It doesn’t seem to be repeatable by anyone.

When I run your code on a very old windows platform (Vista 32bit, CUDA 4.0, GF8800M GTX, Dell XPS M1730 laptop) I get:

Run 0 took 309 ms.
Run 1 took 303 ms.
Run 2 took 312 ms.
Run 3 took 304 ms.
Run 4 took 312 ms.
Run 5 took 308 ms.
Run 6 took 308 ms.
Run 7 took 309 ms.
Run 8 took 309 ms.
Run 9 took 308 ms.
Run 10 took 307 ms.
Run 11 took 314 ms.
Run 12 took 312 ms.
Run 13 took 302 ms.
Run 14 took 307 ms.
Run 15 took 302 ms.
Run 16 took 311 ms.
Run 17 took 304 ms.
Run 18 took 310 ms.
Run 19 took 309 ms.
Fastest time: 302
Final results:
0.1982
1.8115
2.4738
3.1162
4.4517
5.3983
6.4263
7.3513
8.9489
9.7961

I think I win the prize for the slowest GPU here.

Topic		Replies	Views
CUDA execution multiples of 16ms CUDA Programming and Performance	14	2056	May 30, 2015
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1728	July 19, 2022
Inconsistent CUDA Kernel Execution Times in Sequential Execution CUDA Programming and Performance cuda	6	208	June 11, 2024
Different times Ubuntu Vs Windows CUDA Programming and Performance	8	1676	October 12, 2015
Inconsistent kernel run times CUDA Programming and Performance	12	5784	August 5, 2009
Execution timings varying from instance to instance CUDA Programming and Performance	10	481	September 29, 2023
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12707	July 26, 2010
Peaks and slow performance with cudaDeviceSynchronize CUDA Programming and Performance cuda	6	2717	November 17, 2021
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	738	December 18, 2017
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10571	June 21, 2009

First kernel execution takes longer

Related topics