Diff. between CPU / GPU kernel execution time

Sandra_Ullrich · May 18, 2010, 3:36am

Hi Forum!

I used the CUDA-profiler to take a closer look at the kernel runtime. I discovered considerable fluctuations. What are the reasons for runtime differences about 80 msec. (340-260=80) in GPU time? Please take a look at the enclosed screenshot "

". Till now I couldn’t find an answere to the high CPU times, too. Why does a calculation of 270 msec. at GPU takes over 1662 msec. at CPU time (e.g. line 27)? That’s about 6 times slower!?!. Also please take a look at line 28: here the GPU time takes 304 msec. and the CPU time takes very long 4984 msec. That’s about 16 times slower CPU-runtime than the actual GPU-runtime. Is this in consequence of the slow latency of memory chips on the videocard? How can I minimize the difference between GPU kernel execution time and the elapsed CPU time?

Maybe that’s the same reason for the long GPU idle time between the upload to device an the kernel execution (please take a look at the enclosed screenshot "

"). How can I minimize this gap? Should (or better: could) I make use of streams or asynchron memcopys? As far as I understand that will only reduce the gap between upload and kernel execution, but how can I bring my CPU-kernel runtime near to GPU-kernel runtime? I missing the forest through the trees. - May I please asked you for some helping answeres? Enclosed you find my entire code.

Thanks for your help.

Sandra

texture<unsigned char, 2, cudaReadModeElementType> imgTex;

__global__ void debayer_TM( uchar3 *res, int width, int height )

{

	unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

	unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

	uchar3 c;

	if ( x > 1 && y > 1 && x <= width && y <= height )

	{

		if ( (x&1)==0 )

		{

			if ( (y&1)==0 )

			{

				c.x = tex2D(imgTex, x-1, y-1); //blue

				c.y = tex2D(imgTex, x-1, y  ); //green

				c.z = tex2D(imgTex, x  , y  ); //red

			}

			else

			{

				c.x = tex2D(imgTex, x-1, y  ); //blue

				c.y = tex2D(imgTex, x  , y  ); //green

				c.z = tex2D(imgTex, x  , y-1); //red

			}

		}

		else

			if ( (y&1)==0 )

			{

				c.x = tex2D(imgTex, x  , y-1); //blue

				c.y = tex2D(imgTex, x  , y  ); //green

				c.z = tex2D(imgTex, x-1, y  ); //red

			}

			else

			{

				c.x = tex2D(imgTex, x  , y  ); //blue

				c.y = tex2D(imgTex, x-1, y  ); //green

				c.z = tex2D(imgTex, x-1, y-1); //red

			}	

	}

	// write result

	res[y*width + x] = c;

}

extern "C" { void CudaDeBayerTM( IplImage *iplIn, IplImage *iplOut )  

{

	//declare device pointer

	uchar3 *DEVres;

	cudaArray *imgArray;

	// create channel descriptor for 2D cuda array

	cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<unsigned char>();

	//malloc device memory

	int size = sizeof(unsigned char)*iplIn->width*iplIn->height;

	cudaMallocArray(&imgArray, &channelDesc, iplIn->width, iplIn->height);

	cudaMalloc((void**)&DEVres, sizeof(uchar3)*iplIn->width*iplIn->height);

	//copy host2device

	cudaMemcpy2DToArray(imgArray, 0, 0, (unsigned char*) iplIn->imageData, sizeof(unsigned char) * iplIn->widthStep, sizeof(unsigned char) * iplIn->width, iplIn->height, cudaMemcpyHostToDevice);

	// bind the array to the texture

	cudaBindTextureToArray(imgTex, imgArray, channelDesc);

	

	//launch kernel

	dim3 block(16, 16);

	dim3 grid(iplIn->width/block.x, iplIn->height/block.y);

	debayer_TM <<< grid,block >>> ( DEVres, iplIn->width, iplIn->height );

	CUDA_SAFE_CALL(cudaThreadSynchronize());

	

	//copy device2host

	cudaMemcpy(iplOut->imageData, DEVres, sizeof(uchar3)*iplIn->height*iplIn->width, cudaMemcpyDeviceToHost);

	//unsigned char *imgChar = (unsigned char*) &iplIn->imageData[0];

	//free memory on device and host

	cudaFreeArray(imgArray);

	cudaUnbindTexture(imgTex);

	cudaFree(DEVres);

}

Preetha · May 18, 2010, 3:44am

Hi,
By CPU kernel execution, do you mean the ‘device emulation mode’.

Sandra_Ullrich · May 18, 2010, 3:54am

Hi Preetha!

By “CPU kernel execution” I mean the column “CPU time” that the CUDA-profiler prints (pls. see screenshot). I don’t use the device emulation mode.

Preetha · May 18, 2010, 4:19am

It looks like; the reason for this may be the kernel calls are all asynchronous. So, it will be waiting in the CPU (when using cudaThreadSynchronize()) for the kernel execution to complete.

avidday · May 18, 2010, 5:34am

All those times are in microseconds, so the variation is 1000 times less than you think it is.

You should consider doing the memory allocation and deallocation only once, and keeping the allocations persistent for the life of the application. On some platforms you might find you code spends as much time managing memory as actually running the kernel.

Topic		Replies	Views
GPU vs. CPU GPU is always much slower CUDA Programming and Performance	1	10240	June 5, 2009
Execution time is different in Profiller and Console. why? CUDA Programming and Performance	4	3742	August 3, 2009
Inconsistent kernel run times CUDA Programming and Performance	12	5781	August 5, 2009
How to explain the performance difference? CUDA Programming and Performance	7	3506	March 26, 2008
Simple proven (timed) example code where GPU beats CPU, anyone? CUDA Programming and Performance	6	1118	November 1, 2013
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
CUDA runtimes bigger that CPU without loading time include CUDA Programming and Performance	10	2397	April 28, 2010
Overlapping GPU and CPU computation? CUDA Programming and Performance	9	1243	November 19, 2010
Overlapping GPU and CPU computation? CUDA Programming and Performance	5	4281	November 15, 2010
processing time check CUDA Programming and Performance	5	551	November 16, 2010

Diff. between CPU / GPU kernel execution time

Related topics