kernel launch overhead for GTX 280

kolonel · October 26, 2009, 10:43pm

Hi all,
I calculated the kernel launch overhead for GTX 280 around 60microsecond, but I’ve seen papers reporting this time around 10microsecond (including thread synchronizing). Does this kernel lunch overhead depends on the size of grid and thread blocks? what is the standard way to calculate this kernel lunch overhead?
Thanks

kolonel · November 3, 2009, 5:31pm

No any reply for this post?!! please guys…

MisterAnderson42 · November 3, 2009, 5:35pm

A very long time ago, I looked at the launch overhead of 8800 GTX as a function of the number of blocks and found it to be linearly increasing. I haven’t looked at it since then. It is a very simple benchmark to perform just time an empty kernel with increasingly larger block sizes and plot the results. I suggest you try it and post the plot!

kolonel · November 3, 2009, 5:45pm

I found this paper [url=“http://www.netlib.org/lapack/lawnspdf/lawn202.pdf”]http://www.netlib.org/lapack/lawnspdf/lawn202.pdf[/url] which in section 3.1 it says the kernel lunch time is around 10microsec for OLD gpus, I measured this time as 60microsec for GTX 280, I supposed in 280 the kernel lunch time must be lower or at least equal to older gpus, am I right?

MisterAnderson42 · November 3, 2009, 6:43pm

The launch overhead (for small kernels) has steadily improved as CUDA has matured. Current hardware and CUDA 2.3 typically gets a launch overhead of around 7-8 microseconds.

You aren’t running on windows Vista or windows 7, but chance? The overhead on those platforms is much higher compared to linux.

cbuchner1 · November 4, 2009, 1:45pm

My lunch time is around 45 minutes to an hour, but then I don’t really like to eat kernels as I am not a vegetarian. ;-)

Christian

kolonel · November 4, 2009, 9:08pm

I’m running on windows XP64, and CUDA 2.3! This could be the reason? :blink:

kolonel · November 4, 2009, 9:10pm

good one! External Media

tmurray · November 4, 2009, 10:02pm

XP64 launch overhead is <10 microseconds, depending on your PCIe bandwidth. Vista/Win7 is… not.

gshi · November 4, 2009, 10:14pm

I have done the test a week ago, here is some number in gtx280 (forget the datasize, it is an empty function call)

Empty kernel call only:
datasize(bytes) blockdim griddim time(s)
1 (1,1,1) (1,1,1) 0.000015
2 (2,1,1) (1,1,1) 0.000014
4 (4,1,1) (1,1,1) 0.000016
8 (8,1,1) (1,1,1) 0.000015
16 (16,1,1) (1,1,1) 0.000015
32 (32,1,1) (1,1,1) 0.000015
64 (64,1,1) (1,1,1) 0.000014
128 (128,1,1) (1,1,1) 0.000015
256 (256,1,1) (1,1,1) 0.000016
512 (512,1,1) (1,1,1) 0.000015
1024 (512,1,1) (2,1,1) 0.000015
2048 (512,1,1) (4,1,1) 0.000014
4096 (512,1,1) (8,1,1) 0.000015
8192 (512,1,1) (16,1,1) 0.000014
16384 (512,1,1) (32,1,1) 0.000017
32768 (512,1,1) (64,1,1) 0.000016
65536 (512,1,1) (128,1,1) 0.000017
131072 (512,1,1) (256,1,1) 0.000018
262144 (512,1,1) (512,1,1) 0.000022
524288 (512,1,1) (1024,1,1) 0.000029
1048576 (512,1,1) (2048,1,1) 0.000039
2097152 (512,1,1) (4096,1,1) 0.000064
4194304 (512,1,1) (8192,1,1) 0.000115
8388608 (512,1,1) (16384,1,1) 0.000212

kolonel · November 5, 2009, 12:12am

If these times include the cudaThreadSynchronize() too?

Is that possible to share your code to do some comparison?

I inserted my code here. For a GTX 280 and on Windows XP 64, the resulted time including cudaThreadSynchronize() is 0.48millisec, and excluding cudaThreadSynchronize() is 0.039millisec!! Is it a big time?! :blink:

Thanks.

__global__  

void HelloCUDA2(float* device_result1)

{

	

}

int main(int argc, char* argv[])

{

	float *d_A;

	int N=1500;

	CUDA_SAFE_CALL( cudaMalloc((void**) &d_A, sizeof(float) * N * N));

	unsigned int timer = 0;

	CUT_SAFE_CALL( cutCreateTimer( &timer));

	CUT_SAFE_CALL( cutStartTimer( timer));

	HelloCUDA2<<<(N+127)/128, 128>>>(d_A);

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

	

	CUT_SAFE_CALL( cutStopTimer( timer));

	printf("Processing time: %f (ms)\n", cutGetTimerValue( timer));

	CUT_SAFE_CALL( cutDeleteTimer( timer));

	CUDA_SAFE_CALL( cudaFree(d_A));

	return 0;

}

I have done the test a week ago, here is some number in gtx280 (forget the datasize, it is an empty function call)

Empty kernel call only:

datasize(bytes) blockdim griddim time(s)

     1      (1,1,1)         (1,1,1)         0.000015

     2      (2,1,1)         (1,1,1)         0.000014

     4      (4,1,1)         (1,1,1)         0.000016

     8      (8,1,1)         (1,1,1)         0.000015

    16      (16,1,1)        (1,1,1)         0.000015

    32      (32,1,1)        (1,1,1)         0.000015

    64      (64,1,1)        (1,1,1)         0.000014

   128      (128,1,1)       (1,1,1)         0.000015

   256      (256,1,1)       (1,1,1)         0.000016

   512      (512,1,1)       (1,1,1)         0.000015

  1024      (512,1,1)       (2,1,1)         0.000015

  2048      (512,1,1)       (4,1,1)         0.000014

  4096      (512,1,1)       (8,1,1)         0.000015

  8192      (512,1,1)       (16,1,1)        0.000014

 16384      (512,1,1)       (32,1,1)        0.000017

 32768      (512,1,1)       (64,1,1)        0.000016

 65536      (512,1,1)       (128,1,1)       0.000017

131072      (512,1,1)       (256,1,1)       0.000018

262144      (512,1,1)       (512,1,1)       0.000022

524288      (512,1,1)       (1024,1,1)      0.000029

1048576 (512,1,1) (2048,1,1) 0.000039

2097152 (512,1,1) (4096,1,1) 0.000064

4194304 (512,1,1) (8192,1,1) 0.000115

8388608 (512,1,1) (16384,1,1) 0.000212

kolonel · November 5, 2009, 12:16am

For my case it is much more than 10microsec!! I inserted my code at gshi's response…please take a look and see if I am doing something wrong!?!

Thank you.

cbuchner1 · November 5, 2009, 12:19am

First kernel launch takes longer.

Try cooking your kernel first. No, this is not another food joke.

Cooking means launching it once (e…g after the cudaMalloc), doing a cudaThreadSynchronize(). Only then do your timing.

gshi · November 5, 2009, 12:41am

yes, the first kernel needs to be outside the loop. Here is the code

#include <stdlib.h>

#include <stdio.h>

#include <cuda.h>

#include <unistd.h>

#include <sys/time.h>

#include <sched.h>

#define CUERR  do{ cudaError_t err; \

		cudaThreadSynchronize(); \

		if ((err = cudaGetLastError()) != cudaSuccess) { \

		printf("ERROR: CUDA error: %s, line %d\n", cudaGetErrorString(err), __LINE__); \

		exit(-1);													   \

		}}while(0)

double

gettime(void)

{

	struct timeval t;

	gettimeofday(&t, NULL);

	return t.tv_sec + 0.000001* t.tv_usec;

}

__global__ void

empty_kernel(void* p, int len)

{

}

int

kernel_launch_test()

{

	char* A = NULL;

	char* hostA = NULL;

	double t0, t1;

	unsigned int max_data_size = 1 << 29;

	int i;

	int repeat_num = 1000;

	char hostname[64];

	gethostname(hostname, 64);

	printf("Runing on host %s\n", hostname);

	cpu_set_t cpuset;

	CPU_ZERO(&cpuset);

	CPU_SET(0, &cpuset);

	int device = 1;

	cudaSetDevice(device);CUERR;

	int datasize =1;

	cudaMallocHost((void**)&hostA, max_data_size); CUERR;

	cudaMalloc((void**)&A, max_data_size); CUERR;

	cudaMemcpy(A, hostA, datasize, cudaMemcpyHostToDevice);CUERR;

	empty_kernel<<<1, 1>>>(A, datasize);

	printf("datasize(bytes)  blockdim	  griddim		time(s)\n");

	while( datasize <= max_data_size){

		int blocksize= (datasize <=512)? datasize: 512;

		int gridsize_x = datasize/ blocksize;

		int gridsize_y = 1;

		if (gridsize_x ==0){

			gridsize_x= 1;

		}

#define K64 (32*1024)

		if(gridsize_x > K64){

			gridsize_y = gridsize_x /K64;

			gridsize_x = K64;

		}

		dim3 griddim(gridsize_x, gridsize_y);

		t0 = gettime();

		for (i =0;i <repeat_num;i ++){

			//cudaMemcpy(A, hostA, datasize, cudaMemcpyHostToDevice);CUERR;

			empty_kernel<<<griddim, blocksize>>>(A, datasize);CUERR;

		}

		cudaThreadSynchronize();CUERR;

		t1 = gettime();

		double gpu_time = t1 - t0;

		printf("%10d\t(%d,%d,%d) \t(%d,%d,%d) \t%f\n",

			   datasize,

			   blocksize, 1,1,

			   griddim.x, griddim.y, griddim.z,

			   gpu_time/repeat_num);

		fflush(stdout);

		datasize *=2;

	}

	cudaFreeHost(hostA);CUERR;

	cudaFree(A); CUERR;

	cudaThreadExit();

	return 0;

}

int main()

{

	kernel_launch_test();

}

kolonel · November 5, 2009, 1:02am

Thanks, I had completely forgotten this point…

I cooked the kernel, and now it is close to 20microsec! it is still big, isn’t it?!

cbuchner1 · November 5, 2009, 1:05am

This depends also on your CPU speed, I would assume. A lower clocked CPU might take 20microsec, a faster one only 15microsec to perform the same task.

kolonel · November 5, 2009, 1:06am

Thanks for the code, BTW, what is your platform: linux or windows?!

I tried to compile your code, but it cannot find sched.h and unistd.h!

kolonel · November 5, 2009, 1:15am

mine is AMD Phenom 9850, 2.5GHz, 4GB RAM, and PCIe bandwidth 8GB/S.

I just want to be sure the GPU works well with its most efficiency!

Topic		Replies	Views
kernel call overhead: timing results overhead is large for small # of calls CUDA Programming and Performance	16	7976	March 8, 2013
kernel launch overhead Legacy PGI Compilers	8	12684	July 24, 2014
reduce overhead of launching a new thread block CUDA Programming and Performance	15	4821	February 15, 2018
Any way to measure the latency of a kernel launch? CUDA Programming and Performance	13	7793	July 22, 2022
fundamental cuda kernel launch questions CUDA Programming and Performance	2	16549	July 31, 2008
Kernel launch overhead for GTX 680 CUDA Programming and Performance	2	1881	May 19, 2012
Kernel execution overhead CUDA Programming and Performance	2	1201	July 6, 2009
kernel launch overhead timing best practices CUDA Programming and Performance	3	10059	June 24, 2014
overhead between two successive kernel calls CUDA Programming and Performance	6	1840	July 7, 2013
Slow loading kernel to GPU CUDA Programming and Performance	11	13043	April 18, 2008

kernel launch overhead for GTX 280

Related topics