Inconsistent kernel run times

I’ve run into an interesting problem while working on my newest CUDA kernel. The function accepts an array of image data, then applies a gaussian filter and a gradient of gaussian filter, then outputs the result of each calculation to two separate arrays. For some reason, the time that the function takes to run varies between each execution, it either takes ~.3 seconds or ~.03 seconds. Due to tests of previous algorithms, the expected run speed is the faster one, and I’d like to figure out how I can consistently get it to run at this speed. Here is the kernel code:

__global__ static void gaussianGoGCUDA(float * pImageR,float * pImageG, float * pImageB, 

									   float * pImageRGauout, float * pImageGGauout, float * pImageBGauout, 

									   float * pImageRGradout, float * pImageGGradout, float * pImageBGradout,

									   float * gaus, float * ggx, float * ggy, int height, int width, int win) {


	__shared__ float dataR[I_SHARED_SIZE * I_SHARED_SIZE];

	__shared__ float dataG[I_SHARED_SIZE * I_SHARED_SIZE];

	__shared__ float dataB[I_SHARED_SIZE * I_SHARED_SIZE];

		int i,j,k,l,hWin;

		int row,col;

	float Gau1=0,Gau2=0,Gau3=0;

		float Gx1=0,Gx2=0,Gx3=0,Gy1=0,Gy2=0,Gy3=0;



		i=blockIdx.x*blockDim.x + threadIdx.x;

		j=blockIdx.y*blockDim.y + threadIdx.y;

		int curidx = j * width + i;

		//Initialize shared data with corresponding data from image.

		//We can use the thread idx numbers to map the image data into the

		//correct index of shared memory.

		if((i) < width && (j) < height){ 

			dataR[threadIdx.y * I_SHARED_SIZE + threadIdx.x] = pImageR[curidx];

			dataG[threadIdx.y * I_SHARED_SIZE + threadIdx.x] = pImageG[curidx];

			dataB[threadIdx.y * I_SHARED_SIZE + threadIdx.x] = pImageB[curidx];


			//if the current thread is the last row or column, also populate the

			//win rows/columns that lie outside the current block.

			if(threadIdx.x == 15){

				for(int ii = 1; ii < win; ii++){

					curidx = j * width + (i+ii);

					dataR[threadIdx.y * I_SHARED_SIZE + threadIdx.x+ii] = pImageR[curidx];

					dataG[threadIdx.y * I_SHARED_SIZE + threadIdx.x+ii] = pImageG[curidx];

					dataB[threadIdx.y * I_SHARED_SIZE + threadIdx.x+ii] = pImageB[curidx];



			if(threadIdx.y == 15){

				for(int jj = 1; jj < win; jj++){

					curidx = (j+jj) * width + i;

					dataR[(threadIdx.y+jj) * I_SHARED_SIZE + threadIdx.x] = pImageR[curidx];

					dataG[(threadIdx.y+jj) * I_SHARED_SIZE + threadIdx.x] = pImageG[curidx];

					dataB[(threadIdx.y+jj) * I_SHARED_SIZE + threadIdx.x] = pImageB[curidx];



			//If we're in the last thread of the block, we also need to initialize the

			//lower corner of shared memory that lies outside the current block

			if(threadIdx.y == 15 && threadIdx.x == 15){

				//for each row of j, add win columns to shared memory

				for(int jj = 1; jj<win; jj++){

					for(int ii = 1; ii < win; ii++){

						curidx = (j+jj) * width + (i+ii);

						dataR[(threadIdx.y+jj) * I_SHARED_SIZE + (threadIdx.x+ii)] = pImageR[curidx];

						dataG[(threadIdx.y+jj) * I_SHARED_SIZE + (threadIdx.x+ii)] = pImageG[curidx];

						dataB[(threadIdx.y+jj) * I_SHARED_SIZE + (threadIdx.x+ii)] = pImageB[curidx];






			dataR[0] = -1;

			dataG[0] = -1;

			dataB[0] = -1;


		//sync threads to insure all data is loaded.


		//indexing here needs to be in respect to our shared memory indexes,

		//using threadIdx and shared memory dimensions instead of Image dimensions

		curidx = threadIdx.y * I_SHARED_SIZE + threadIdx.x;

		if((i+win-1) < width && (j+win-1) < height){ 

		   for(k=0;k<win;k++){		//Old calculation: (i+k) * 20 + (j+l)

				for(l=0;l<win;l++){   // i + (k * 20) + l

					Gau1+=(float)dataR[curidx + (k * I_SHARED_SIZE) + l]*gaus[k*win + l];

					Gx1+=(float)dataR[curidx + (k * I_SHARED_SIZE) + l]*ggx[k*win + l];

					Gy1+=(float)dataR[curidx + (k * I_SHARED_SIZE) + l]*ggy[k*win + l];

					Gau2+=(float)dataG[curidx + (k * I_SHARED_SIZE) + l]*gaus[k*win + l];

					Gx2+=(float)dataG[curidx + (k * I_SHARED_SIZE) + l]*ggx[k*win + l];

					Gy2+=(float)dataG[curidx + (k * I_SHARED_SIZE) + l]*ggy[k*win + l];

					Gau3+=(float)dataB[curidx + (k * I_SHARED_SIZE) + l]*gaus[k*win + l];

					Gx3+=(float)dataB[curidx + (k * I_SHARED_SIZE) + l]*ggx[k*win + l];

					Gy3+=(float)dataB[curidx + (k * I_SHARED_SIZE) + l]*ggy[k*win + l];



		   curidx = (j+hWin) * width + (i+hWin);


		   //store results into the output arrays.









Thanks for any advice.

Are you using cudaEvents for timing?
Is it just the first run that’s slower?


I’m using time.h for timing by calling clock() before and after the kernel call. Probably not perfect, but it’s close enough to give an accurate picture.

The slower run actually occurs more often than not, unfortunately. I’d say roughly one out of every seven runs or so gets the faster execution time.

I hope you’re iterating the kernel and then performing a cudaThreadSynchronize before you stop the timer.
Kernels calls are asynchronous and return control to the host immediately. If you stop the timer after a kernel call without performing an operation which requires the kernel to complete (such as a memcpy of the result) then you won’t be timing the kernel execution time but just the driver overhead for performing the kernel call. Could be that it stalls sometimes because the driver queue is full.


I am performing a cudaThreadSynchronize before stopping the timer, so that shouldn’t be it. Forgive my ignorance, but could you elaborate a bit on the driver queue being full and how I could check this? I’m assuming that would be reflected in my timings if it were indeed the issue, due to the position of my clock calls. Thanks for all the help.

I got it from tmurray’s post in this thread.

I believe it means, that if you fill up the driver queue, it does not return control to the host but instead it waits until there is room in the queue once again.

In other words, it will only return control to the host once the queue isn’t full.


What is CLOCKS_PER_SEC on your system? clock() is an extremely low resolution timer and it could very well be that the difference between 0.3 and 0.03 seconds is just 1 tick of the clock.

Use gettimeofday instead: it is good to 10 microseconds.

gettimeofday, according to my man pages, is of undefined resolution. It is usually 1-10usec, but not guaranteed to be so.

I’m doubtful that the device queue is my problem. I only make two calls to the function in sequence, so it probably isn’t enough to fill up the device queue.

I’m not on the system that gets the unusual timing results right now, so I can’t tell you exactly what CLOCKS_PER_SEC is right now. However, there is an observable real time difference between a fast run of the two function calls and a slow run (.8 seconds for a slow run, .2 for a fast run), so I don’t believe it’s simply a timing issue.


Try running the kernel a few hundred times in a row and look at the gputime in the CUDA profiler. It would be interesting to see if the short/long run times persist after dozens of kernel runs.

I’m downloading the CUDA profiler now, but I went ahead and tried this experiment with the timing code I’d been using and it returned some interesting results. I ran 400 executions of my kernel, and only the first 4 displayed the longer .3 second execution time. All the rest from the 5th to the end displayed the faster .03 time. It almost looks like the behavior the card displays when you ask it to run the first CUDA operation of a session, but this kernel call is very far from the first that I execute.

May be, it is the GPU clocks and the screen-saver thing (discussed before)…

Did any screen saver run in the middle of your kernel runs?

Also multi-core CPUs can cause timing craziness if you dont tie your thread to a core…

I use “SetThreadAffinityMask(GetCurrentThread(), 1)” – on my Windows System on the thread that profiles. FYI

I had a similar problem, though not quite so drastic (times varied between 30 and 50+ ms). I tried all the ‘solutions’ suggested above, but they didn’t help. I even thought it might be the chip temperature, because the first run was usually the fastest.

Then I did something which I should have been doing anyway - at the end of the program, I did a ‘free’ on three large images I was using (9MB each) on the host side. For some reason, this solved the problem! It shouldn’t have made any difference, but it seemed to be that step which did the trick. Now the program runs consistently (within a few ms each time), at the fastest speed. Relief!

(Mind you, I have another problem. I’m using Visual Studio (C++). When I compile under Release, one of the algorithms is twice as slow as under Debug! If anyone could shed any light on that, I’d be glad to know. The algorithm in question doesn’t do anything exotic: it’s kmeans clustering.)