First kernel execution takes longer

Hi all

Going straight to my problem. When I run my program, the measured time for the first 1-3 kernel calls is noticeably longer than for the rest. Do you have any idea what’s the cause for this?

Main:

#include "stdafx.h"
#include <iostream>
#include <ctime>
#include <cuda.h>
#include <cuda_runtime.h>
#include "ClosestNeighbor1.h"
#include "ClosestNeighbor1.cu"

using namespace std;

int main()
{
	// Number of points
	const int count = 10000;
	
	// Arrays of points
	int *indexOfClosest = (int *)malloc(count * sizeof(int));
	float3 *points = (float3 *)malloc(count * sizeof(float3));

	float3 *d_points;
	int *d_indexOfClosest;

	cudaMalloc(&d_points, count * sizeof(float3));
	cudaMalloc(&d_indexOfClosest, count * sizeof(int));

	// Create a list of random points
	for(int i = 0; i < count; i++)
	{
		points[i].x = (float)((rand()%10000) - 5000);
		points[i].y = (float)((rand()%10000) - 5000);
		points[i].z = (float)((rand()%10000) - 5000);
	}

	cudaMemcpy(d_points, points, sizeof(float3) * count, cudaMemcpyHostToDevice);

	// This var is used to keep track of the fastest time so far
	long fastest = 1000000;

	// Run the algorithm 20 times
	for(int q = 0; q < 20; q++)
	{
		long startTime = clock();

		// Run the algorithm
		//FindClosestCPU(points, indexOfClosest, count);

		FindClosestGPU<<<(count / 1024) + 1, 1024>>>(d_points, d_indexOfClosest, count);
		cudaMemcpy(indexOfClosest, d_indexOfClosest, sizeof(int) * count, cudaMemcpyDeviceToHost);

		long finishTime = clock();

		cout << "Run " << q <<" took " << (finishTime - startTime) << " ms." << endl;

		// If that run was faster update the fastet time so far
		if((finishTime - startTime) < fastest)
			fastest = (finishTime - startTime);
	}

	// Print out the fastest time 
	cout << "Fastest time: " << fastest << endl;

	// Print the final results to screen
	cout << "Final results:" << endl;
	for(int i = 0; i < 10; i++)
		cout << i << "." << indexOfClosest[i] << endl;

	// Deallocate RAM
	free(points);
	free(indexOfClosest);

	cudaFree(d_indexOfClosest);
	cudaFree(d_points);

	cudaDeviceReset();

	return 0;
}

Kernel

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>

__global__ void FindClosestGPU(float3* points, int* indices, int count)
{
	if(count <= 1) return;

	int idx = threadIdx.x + blockIdx.x * blockDim.x;
	if(idx < count)
	{
		float3 thisPoint = points[idx];
		float smallestSoFar = 3.40282e38f;

		for(int i = 0; i < count; i++)
		{
			if(i == idx) continue;

			float dist = (thisPoint.x - points[i].x) * (thisPoint.x - points[i].x);
			dist += (thisPoint.y - points[i].y) * (thisPoint.y - points[i].y);
			dist += (thisPoint.z - points[i].z) * (thisPoint.z - points[i].z);

			if(dist < smallestSoFar)
			{
				smallestSoFar = dist;
				indices[idx] = i;
			}
		}
	}
}

If you are running this on Windows with the default WDDM driver, measurements may be erratic due to the inherent high overhead of the WDDM driver model and the launch batching the CUDA driver performs to work around that.

Your observation could also be evidence of artifacts caused by the measurement methodology. Have you compared kernel execution times as reported by the profiler? That is the best way to evaluate kernel execution times. If you need to measure from the host, I would suggest the following framework for measuring kernel execution times from the host:

dummy = timer(); // make sure timer code is “warmed up”
cudaDeviceSynchronize(); // ensure all previous GPU work has finished on the current device
start = timer();
kernel<<<…>>>(…)
cudaDeviceSynchronize(); // ensure that kernel has finished
stop = timer();
elapsed = stop - start;

timer() should be based on a high-resolution system timer, not clock() which has low resolution. On Linux, gettimeofday() is recommended. Note that the above sequence will still include the overhead for the call to cudaDeviceSynchronize(), which is about 15 usec in recent versions of CUDA if I recall correctly.

When I run your code on linux, modified to use gettimeofday() instead of clock(), these are the results I get:

$ ./t625
Run 0 took 8.951 ms.
Run 1 took 8.931 ms.
Run 2 took 8.93 ms.
Run 3 took 8.927 ms.
Run 4 took 8.929 ms.
Run 5 took 8.93 ms.
Run 6 took 8.927 ms.
Run 7 took 8.928 ms.
Run 8 took 8.928 ms.
Run 9 took 8.926 ms.
Run 10 took 8.929 ms.
Run 11 took 8.926 ms.
Run 12 took 8.93 ms.
Run 13 took 8.929 ms.
Run 14 took 8.929 ms.
Run 15 took 8.926 ms.
Run 16 took 8.928 ms.
Run 17 took 8.926 ms.
Run 18 took 8.927 ms.
Run 19 took 8.925 ms.
Fastest time: 8.925
Final results:
0.9976
1.7568
2.2987
3.3917
4.4618
5.9785
6.2094
7.9354
8.2397
9.6021
$

I thought the first call was supposed to be slower, because of JIT? (Except in this case, one is probably just measuring PCI bandwidth or latency)

BTW (from the example), how come nvcc doesn’t accept “using namespace std;” ?

Not getting the same problem with Windows 7 64 bit WDDM driver using that exact code;

using CUDA 6.0 GTX 780ti, compute 3.5

Run 0 took 4 ms.
Run 1 took 4 ms.
Run 2 took 4 ms.
Run 3 took 3 ms.
Run 4 took 4 ms.
Run 5 took 4 ms.
Run 6 took 4 ms.
Run 7 took 4 ms.
Run 8 took 3 ms.
Run 9 took 4 ms.
Run 10 took 4 ms.
Run 11 took 4 ms.
Run 12 took 4 ms.
Run 13 took 4 ms.
Run 14 took 4 ms.
Run 15 took 4 ms.
Run 16 took 4 ms.
Run 17 took 4 ms.
Run 18 took 4 ms.
Run 19 took 4 ms.
Fastest time: 3
Final results:
0.1982
1.8115
2.4738
3.1162
4.4517
5.3983
6.4263
7.3513
8.9489
9.7961

same machine with CUDA 6.5, GTX 980 , compute 5.2

Run 0 took 9 ms.
Run 1 took 9 ms.
Run 2 took 9 ms.
Run 3 took 9 ms.
Run 4 took 9 ms.
Run 5 took 9 ms.
Run 6 took 9 ms.
Run 7 took 10 ms
Run 8 took 8 ms.
Run 9 took 9 ms.
Run 10 took 9 ms
Run 11 took 9 ms
Run 12 took 9 ms
Run 13 took 9 ms
Run 14 took 9 ms
Run 15 took 9 ms
Run 16 took 9 ms
Run 17 took 9 ms
Run 18 took 9 ms
Run 19 took 9 ms
Fastest time: 8
Final results:
0.1982
1.8115
2.4738
3.1162
4.4517
5.3983
6.4263
7.3513
8.9489
9.7961

Those were both ‘cold starts’…

I wonder why the big difference between those two GPU compiled under different versions of CUDA?

Was anybody else able to reproduce this issue in Windows? I never get a huge difference the first and subsequent runtime unless I call via a MATLAB mex, and that is a MATLAB specific problem.

In general I have never had any WDDM related issues with the GTX line when compared to Tesla(TCC)., at least none which have made a significant difference. I work with both of them on a daily basis.

Was this an issue mainly with the older GPUs and older versions of CUDA?

The issue doesn’t repro for me (Windows 7, latest drivers). It may be specific to the original posters environment. I doubt it is GPU specific, but there is no way of knowing for sure without trying.

Generally speaking, the CUDA driver has been optimized in recent years to reduce various overheads. There was a note regarding specific improvements in the CUDA 6.5 time frame: either in the release notes, the performance brief, or Mark Harris’s blog post. Despite these improvements, some basic overhead issues and performance artifacts due to launch batching under WDDM still seem to exist (e.g. launch times for an individual kernel still seems to vary fairly widely, say between 5 to 25 usec) while there is almost no variation seen on Linux.

[Later:]
As far as CUDA 6.5 soecific launch overhead reduction goes, I was apparently thinking of the MPS improvements documented in the performance brief (obviously unrelated to WDDM):
[url]http://developer.download.nvidia.com/compute/cuda/6_5/rel/docs/CUDA_6.5_Performance_Report.pdf[/url]

Surprisingly enough, when I run the code through the Nvidia Profiler, I get following times:

And a subsequent run with Visual Studio produces nearly the same results:

But as soon as I close the profiler and Visual Studio and then start it again, it produces the results from original post.

This granularity you are showing now is due to ordinary timing granularity of the clock() function:

https://randomascii.wordpress.com/2013/07/08/windows-timer-resolution-megawatts-wasted/

This can be overridden in any app using a function such as timeBeginPeriod().

Windows will use the minimum period requested (as low as 1ms) by any currently running app.

I’m not sure why you got 1ms resolution prior to this - it may have something to do with your project setup. I’m also not sure why CudaaduC got 1ms resolution either. But I’m not a windows expert.

My guess is that in your current run, however, the lead-in variability is gone.

In retrospect, 30+ms variation is probably far too high to attribute to WDDM issues, so I withdraw that comment I made. I have no idea what the source of that variation is. It doesn’t seem to be repeatable by anyone.

When I run your code on a very old windows platform (Vista 32bit, CUDA 4.0, GF8800M GTX, Dell XPS M1730 laptop) I get:

Run 0 took 309 ms.
Run 1 took 303 ms.
Run 2 took 312 ms.
Run 3 took 304 ms.
Run 4 took 312 ms.
Run 5 took 308 ms.
Run 6 took 308 ms.
Run 7 took 309 ms.
Run 8 took 309 ms.
Run 9 took 308 ms.
Run 10 took 307 ms.
Run 11 took 314 ms.
Run 12 took 312 ms.
Run 13 took 302 ms.
Run 14 took 307 ms.
Run 15 took 302 ms.
Run 16 took 311 ms.
Run 17 took 304 ms.
Run 18 took 310 ms.
Run 19 took 309 ms.
Fastest time: 302
Final results:
0.1982
1.8115
2.4738
3.1162
4.4517
5.3983
6.4263
7.3513
8.9489
9.7961

I think I win the prize for the slowest GPU here.